Elastic Stack Cross-Cluster Search

Thumbnail image is licensed under CC BY-NC-SA 2.0 and adapted from "cell division sequence" by Leo Reynolds which is licensed under CC BY-NC-SA 2.0

Summary

In this tutorial we will describe what is cross-cluster search in an Elastic stack, why you would want to do it, and how to set it up.


Cross-cluster search

First off, what is an Elastic cluster?

Elasticsearch is built to be always available and to scale with your needs. It does this by being distributed by nature. You can add servers (nodes) to a cluster to increase capacity and Elasticsearch automatically distributes your data and query load across all of the available nodes.

Second, what is cross-cluster search?

Cross-cluster search lets you run a single search request against one or more remote clusters. For example, you can use a cross-cluster search to filter and analyze log data stored on clusters in different data centers.

Third, why would you want to bother searching across different clusters?

Let’s pretend I have two Elastic clusters - the first is running on newer hardware with fast processors, lots of RAM, and a decent amount of very fast SSD storage. The performance of a cluster like that is going to be great but it’s not going to come cheap so maybe I want to reserve it for important data such as security logs or performance metrics for mission-critical systems. Now let’s say my second cluster is comprised of older hardware that isn’t as fast but contains lots and lots of cheap HDD storage that can be used for high-volume data such as netflow traffic or DNS query logs that I can store for longer periods of time.

Instead of smooshing all that data into one monolithic cluster (at great expense) I can split them up into multiple clusters, optimize each for a particular data type or use case, and still be able to search everything from one instance of Kibana.

Here is a 5-minute video explaining CCS at a high level.

Learn how cross-cluster search is integrated into Kibana and Machine Learning (ML) and see how this can be valuable to your logging and SecOps customers.Cros...

Below is another, slightly longer video discussing not only CCS but also more general sizing and performance considerations for your clusters. I would highly recommend watching the video in it’s entirety but if you are just interested in CCS then skip to 39:40 in the presentation.

Do you find yourself wondering what should be taken into consideration when trying to size an Elastic cluster for Security use case?Join James Spiteri, Princ...


Setting Up The Lab

To start we will set up two Elastic stacks, insert a Logstash server in front of them to redirect traffic flows, link one of them as a remote cluster, and configure both a Winlogbeat and Packetbeat endpoint that will each feed their respective logs into a separate cluster. To make things easier I used the lab environment that I had previously set up in my tutorials on building an Elastic SIEM home lab and simply cloned it (with some tweaks to the configurations) to get the second stack.

If you are just getting started with Elastic then see Part 1 of that series for a quick and easy walkthrough on getting a stack set up; after you have both stacks built then you are ready to follow along with the rest of this tutorial.

When we are done our lab will look something like this:


Logstash

What is Logstash?

Logstash is an open source data collection engine with real-time pipelining capabilities. Logstash can dynamically unify data from disparate sources and normalize the data into destinations of your choice.

How it works

The Logstash event processing pipeline has three stages: inputs → filters → outputs. Inputs generate events, filters modify them, and outputs ship them elsewhere. Inputs and outputs support codecs that enable you to encode or decode the data as it enters or exits the pipeline without having to use a separate filter.

This processing pipeline is important because it allows us to take traffic flows from different sources and split them off towards separate destinations.

Installation

I am starting off with a fresh install of Ubuntu server 20.04, fully updated and ready to go. The official install instructions for Logstash are pretty good so I will be following those and opted to use the package repo method via APT:

Download and install the Public Signing Key:

wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -

Install the apt-transport-https package

sudo apt-get install apt-transport-https

Save the repository definition to /etc/apt/sources.list.d/elastic-7.x.list:

echo "deb https://artifacts.elastic.co/packages/7.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-7.x.list

Run sudo apt-get update and the repository is ready for use.

sudo apt-get update && sudo apt-get install logstash

Notice that the requisite instance of OpenJDK was automatically installed along with Logstash as part of the package repo process. You do have the option of skipping the repo and installing from a stand-alone binary but then you are responsible for installing Java separately.

One other (optional) setting is to configure Logstash to start at system boot

sudo /bin/systemctl daemon-reload
sudo /bin/systemctl enable logstash.service

Configuration

The two most important files are the pipeline config file located in /etc/logstash/conf.d/ and the main settings file, /etc/logstash/logstash.yml. Because this is such a basic setup we will not need to worry about the settings file in our lab but be aware that it contains important security settings that would be necessary in a production environment.

By default the conf.d directory is empty and it’s expected that you will populate it with your own .conf file. Fortunately Elastic includes an example file in the main Logstash directory that we can copy and modify to fit our needs.

If we view the file we see a very basic pipeline that takes Beats input and outputs to a single instance of Elasticsearch.

Next we are going to tweak the file, configuring it to divert the output based on the source, and then copy the file to the conf.d folder. Keep in mind that if the Logstash service is currently running it will attempt to read the pipeline file and immediately execute on it.

Notice the ‘if’ and ‘else if’ conditional statements in the output section; this is telling Logstash that if an input matches a specific condition then forward it to one destination, otherwise forward it to the other. You can read the docs regarding event data and fields to get a more in-depth explanation. You can also read this Elastic blog post that details another scenario using conditionals to help give some context around this feature and how it can be useful.

My pipeline file is now set properly in /etc/logstash/conf.d/logstash-sample.conf and though it is not necessary I went ahead and restarted the Logstash service. Also, note that the name of the file doesn’t matter as long as it ends in “.conf”.

You can check the Logstash event logs located in /var/log/logstash/logstash-plain.log and see not only the successful connection to each of the clusters but also the Beats listener starting up on port 5044 per the input settings in the pipeline file.


Beats Agent Setup

Recall from the diagram that we intend for the output from a Winlogbeat agent to feed into cluster #1 and the output from a Packetbeat agent to feed into cluster #2.

Winlogbeat

For instructions on setting up Winlogbeat you can reference the official Elastic documentation or you can check out part 2 of my Elastic SIEM home lab series where amongst other things I set up and configure a Winlogbeat agent.

On the host running Winlogbeat I ran Wordpad as Administrator (it is important to use Wordpad because it respects the formatting of the YAML file whereas Notepad does not) and made the necessary config changes to redirect it’s output from Elasticsearch to Logstash.

Don’t forget to save the file and then restart the service.

Moving over to the first Elastic cluster I can see data successfully being ingested which means I have a complete path from Winlogbeat -> Logstash -> Elasticsearch.

Packetbeat

For the Packetbeat agent I used a vanilla instance of Ubuntu server 20.04, mostly because I already had one available but keep in mind that the agent will run on all the major OS types.

Again, following the install documentation using APT and the repos.

Download and install the Public Signing Key

wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -

Install the apt-transport-https package

sudo apt-get install apt-transport-https

Save the repository definition to /etc/apt/sources.list.d/elastic-7.x.list

echo "deb https://artifacts.elastic.co/packages/7.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-7.x.list

Install Packetbeat

sudo apt-get update && sudo apt-get install packetbeat

Configure it to run at boot

sudo systemctl enable packetbeat

(Optional, though recommended) Set up the Packetbeat dashboards in cluster #2

Edit the /etc/packetbeat/packetbeat.yml config file, plug in the connection settings for both Kibana and the Elastic cluster, save, and run the Packetbeat setup command.

packetbeat setup --dashboards

Finally, go back and edit the /etc/packetbeat/packetbeat.yml file, comment out the Elasticsearch output, point it to Logstash…

…and restart the packetbeat agent.

Moving over to the second Elastic cluster I can see data successfully being ingested which means I have a complete path from Packetbeat -> Logstash -> Elasticsearch.


Remote Clusters

The first step in setting up CCS is to configure at least one of our clusters to be a remote cluster.

The easiest way to do this is via the ‘cluster update settings API’ which you can interact with using the Dev Tools in the Kibana Management section on the first (i.e. “local”) cluster:

Paste the following code into the editor (making any necessary config adjustments for your specific implementation). You can see that the remote cluster

  1. Is persistent

  2. Has an arbitrary name of “cluster_two” (feel free to name yours whatever you wish)

  3. Has one node with the IP address of 10.0.2.12:9300

PUT _cluster/settings
{
  "persistent": {
    "search": {
      "remote": {
        "cluster_two": {
	  "seeds": "10.0.2.12:9300"
	}
      }
    }
  }
}
GET _remote/info

You can see on the right side the successful results of the command.

Beneath that you also see a command to verify the status of the remote cluster.

The final piece of the puzzle is to create a new index pattern that matches the Packetbeat index located on the remote cluster.

Click on the menu icon in the upper left corner -> Management -> Stack Management

Next go to Index Patterns -> Create index pattern

Type in the name of the remote cluster and the input type (Packetbeat in this case). You should see a message indicating one or more indexes match your pattern.

You can choose whatever field you like from the drop-down but I usually go with ‘@timestamp’ as it’s well suited to the timeseries based data that I tend to ingest. When you are done click the ‘Create index pattern’ button in the bottom right.

Browse back to Analytics -> Discover, select your new index pattern, and pat yourself on the back :)

In order for your remote cluster logs to show up in the Security app you need to go back to Management -> Advanced Settings -> Security Solution -> Elasticsearch Indices and add the new index pattern you created.

Make sure to save your changes and then browse to Security -> Overview and at the bottom you should see the Event sources reflecting the logs from the remote cluster.