Elasticsearch total cost of operation is a big pain point for many organizations. From incorrect heap space sizing to having the wrong type of hardware. Learn how you can optimize your Elasticsearch costs.

Elasticsearch is a top contender for document search and log analytics. Many of the default settings are appropriate for basic workloads, but as your data scales, these defaults may begin to burn a hole in your pocket. Add to that instance types, storage mechanisms, memory management and a bunch more, and you could be looking at very expensive cluster costs.

1. Correct Java Heap Size

Elasticsearch runs on Java. In the JVM, the heap is the area of memory allocated for loaded data. Elasticsearch depends heavily on having enough memory to access the data that it is querying rapidly.This memory is used for a few different reasons.

While there are many significant data structures that are running on the disk, Elasticsearch still relies heavily on the heap for data relating to indices and various pointers to existing data. This relationship between the heap and the disk is a fundamental reason why Elasticsearch runs as efficiently as it does.

If you’ve left the heap size at 1GB, it is very likely that your Elasticsearch nodes are not making use of all of the available memory on their virtual machines. This wasted RAM directly translates into wasted money and, worse, poor performance from the cluster. 

Choosing the correct heap space is time consuming but worth doing. The performance benefits alone make it a compelling way to improve your cluster, but with better capacity utilization of your virtual machines comes further Elasticsearch cost-optimization.

A word of warning on the heap size

There are limits to how big you can set your heap size. The creators of Elasticsearch recommend no more than 50% of your available memory should be allocated to the heap. This is to make sure that the other 50% is available for off-the-heap memory storage and OS-level caches.

  1. When you increase the size of your heap, you also increase the impact of garbage collection in the JVM.
  2. Any adjustment to your heap size should be immediately followed by detailed monitoring and benchmarking of your JVM GC pauses, to better understand the impact on performance.

2. Avoid Oversized Dedicated Master Nodes

When you’re looking to run a productionised Elasticsearch cluster, you may wish to assign roles to your nodes. These roles give different nodes responsibilities: 

  • Data nodes store the data, and are in charge of ingesting and optimizing it, as well as satisfying queries.
  • Master Nodes are responsible for coordinating cluster-wide activities, like assigning shards to nodes.
  • Ingest Nodes process incoming documents before they are indexed.
  • Machine Learning Nodes execute Machine Learning jobs (an XPack functionality), and so on.

If you’ve taken the step to separate nodes by specialized roles, which is recommended for any production-grade cluster above 20 nodes, you may have fallen into a common trap. When you set these roles in your cluster, a step that many people often miss is the creation of different instance types, so that each node can undertake a specialist responsibility.

This leads to Master nodes, which are responsible for cluster-wide administrative tasks, with way more memory than they need, driving up your Elasticsearch costs. While data nodes are memory heavy, master nodes are CPU bound and mostly need to always be available.

In your cloud provider, look for instance types that have more CPU than memory, to ensure that more of your resources are effectively utilized by the cluster. You can avoid attaching costly disks as well. Just make sure the node types you choose for master nodes have enough memory to hold the cluster metadata - the larger the cluster, the more memory you’ll need. 

You might be able to allocate more memory to the heap on your master nodes.

If you have small master nodes, because your cluster isn’t particularly large, you can allocate a larger percentage of the physical RAM to your node. Rather than the standard 50%, you can allocate up to 75% in your master nodes. This is because the master node itself isn’t holding any data.

However, as your cluster grows, you may find that the memory requirements of your master nodes increase. In larger clusters, this approach can cause performance and operational complications, so be careful when you turn this dial. Fortunately, thanks to the innate scalability of the cloud, resizing master nodes is trivial.

3. Get Data Retention Right

As your data increases, it won’t be long before your cluster is holding onto a great deal of information. A difficult but crucial question to ask is this: how much of this data do we truly need? Holding onto everything ensures you never miss out on data, but after a while, your memory and storage are being consumed by documents that are never queried.

There are multiple things to consider if you’re going to implement retention policies:

  • Your indexing strategy needs to make it possible to drop whole c at a time, rather than deleting individual documents en masse. Elasticsearch can easily drop a single index, but it may struggle with processing a high volume of individual delete requests.
  • You need to capture metrics on how often different data is queried over time, so you have a data-driven understanding of which documents are used.
  • Whether you want to push the archived documents out into cheap storage, such as AWS S3, so they’re always available if you need them.

These questions are specific to the product you’re building, but either way, there is a simple solution to handle this problem for you.

SLM to the rescue

Snapshot Lifecycle Management (SLM) is a feature in Elasticsearch that allows you to take regular snapshots of your cluster. These snapshots can be held on your Elasticsearch servers and restored whenever you need them, and they can be deleted as needed. This gives you a sliding window of documents in your cluster, to prevent the cluster space from getting out of control.

And you don’t need to delete either!

If you don’t want to delete your snapshots after a certain amount of time, or if you want to move them entirely out of your cluster immediately, you can export them out to S3 instead by creating a backup repository using an S3 bucket. This allows you to make use of cheap storage but retain historical data in the event of an audit or to maintain regulatory compliance.

And the data can still be queried if you need it

If you’re running on AWS, you can still read these documents by using AWS UltraWarm to store documents in S3. This provides direct integration with the AWS OpenSearch project. If you’re not leveraging AWS, you can still make use of Elastic’s Searchable snapshots. This means that you can export and store documents in a cost-effective way, without introducing complex reingestion processes to ever make use of this data again. The drawback is that these queries can be quite slow, but that’s to be expected - you’re making a conscious choice to trade performance for cost.

4. Don't Keep Everything in “Hot” Storage

If your cluster is sufficiently large and all of that data is being queried, even infrequently, you may want to take more fine-grained control of the type of storage you maintain to keep your Elasticsearch costs down. Elasticsearch allows you to set up hot, warm, and cold storage within your cluster. This allows you to keep information readily available, so you don’t need to restore snapshots from external sources, but it also allows you to dedicate resources to only the documents that are in demand.

What does hot, warm, and cold storage look like in practice?

Your hot storage may be virtual machines with huge pools of available memory and local SSD storage. This enables them to index large volumes of information and also interface with the hard disk very quickly. Your warm storage may have a little less memory and not be the fastest of SSDs. Your cold storage could even run on a spinning disk, with smaller pools of memory.

Logs Example: If you retain all of your logs for a month, you may find that logs that were written yesterday are queried far more often than logs that were written 3 weeks ago. In this case, it doesn’t make sense to hold them in the same type of storage, but you also don’t want to lose them from the cluster just yet. Not all solutions will favor tiered storage, but you may find that you can change the access patterns for certain data, based on how it’s currently used.

By setting up the appropriate configuration in your cluster, you can have different indexes travel from hot, to warm, to cold storage as they age. This means you don’t need to use the most expensive possible storage and node types for every document. This is a vital elasticsearch cost optimization step for any large scale cluster and typically yields significant savings.

5. Right Type of Hardware

Once you’ve configured your cluster correctly, the next best step is to ensure that you’ve picked the right storage and virtual machines to optimize your Elasticsearch costs. For example, your data nodes are going to need memory intensive nodes, while your master nodes will rely on the CPU.

Rightsizing your VMs (Virtual Machines)

The exact node type that you have available will depend on your cloud provider, but as a rough guide, here are the common node types you may see in a cluster, and some advice on the instance type you should consider:

Hardware for Master Nodes: Your master nodes should run on CPU optimized instances. However, if any of your instances are voting-only nodes, then they can run on far less powerful instance types since they are never going to run the cluster. Master nodes can typically run on smaller instances than data nodes since they are only delegating requests to the wider cluster.

For Data Nodes: Data nodes do it all, so they will need solid all-round nodes that can handle CPU, Memory and I/O loads. You should focus on memory and I/O as the primary goal, but the CPU can become a bottleneck if you don’t include enough.

Other node types like Ingest nodes and Machine Learning nodes perform processing on incoming documents before they are indexed. This means that most of their activity happens in the CPU, so you should favor CPU optimized instances for these node types. Since those are CPU intensive, if you are using those features a lot you should consider running nodes dedicated to those roles.

But how do you know the right instance type? It’s often an experimental process to find the perfect instance type. If you want to get an edge on the problem, Pulse offers insights into the hardware on which your cluster depends, and can even recommend more appropriate hardware, so that your cluster can run in the most efficient manner possible.

Choosing the correct storage

Possibly more important than your virtual machine type is your storage mechanism. Depending on the type of data you’ve got in your cluster, you may find that your I/O bottleneck renders your powerful servers redundant.

The most common mistake people make when choosing their storage type is to go with a remote storage option, like Elastic Block Storage on AWS. Elastic Block Storage is typically acceptable for smaller clusters but doesn’t scale well over larger clusters. This requires a lot of money spent on provisioned IOPS, for very little performance benefit.

It goes back to hot, warm and cold

Teams often try to save money by choosing a magnetic storage solution over solid-state, across the board. While the cost per HDD is lower with magnetic storage, the disk performance requirements of your cluster will mean that you need more disks available. I/O is commonly the bottleneck, but also “random seek” time is important too. Overall, you spend more on storage and don’t get any performance benefits.

However, if you’re looking at cold storage in your cluster, you might wish to employ magnetic storage for those indices that receive only a few requests. This is something you will learn as your cluster grows and usage patterns present themselves.

This all sounds like a lot of work!

On top of these decisions, you’ve also got other mechanisms for tuning your cluster. For example: taking control of your field mappings, using compression, choosing not to index certain fields and more.

Elasticsearch management is complex, but the benefits of a finely tuned, cost-optimized Elasticsearch cluster are difficult to deny. It can be a long and difficult journey to build the best cluster for your company, and Elasticsearch knowledge is both rare and expensive.

Rather than rolling a base Elasticsearch cluster, you can upgrade your cluster experience to bring sophisticated functionality out of the box. More than that, if you want to behave as the consumer of a cluster, rather than the architect, consider bringing in some experts to help and avoid the most common mistakes on the way to actionable insights.