Quick way to improve Elasticsearch performance on a single machine

In Elasticsearch by Aurimas MikalauskasLeave a Comment

It’s hard to find a server that has less than 4 cores and at least 2 disks these days. Multi-core, multi-disk environments have now become a commodity, yet not all software is built (or, in some cases, configured out of the box) to make the best use of that. Moreover, measuring and instrumenting such systems properly has become increasingly complex, which is why this has been one of the most interesting topics that I give talks in various tech conferences on.

Elasticserach is not an exception. Yes, it is multi-threaded and it does make a pretty good use of available CPUs or disks, especially at high concurrency environment. But. If you want your queries to return faster using as much CPU or Disk capacity as they possibly can, there’s something you can do about it.

Elasticsearch Performance

First, I want to make it clear what do I mean with Performance here, because performance means different things for different people and in different contexts.

In this specific case, I want to focus on response time. Specifically, on a response time of a single search request. And if for the duration of the request, machine uses more resources – CPU, or disk – that is fine. As long as the search request returns faster.

Because we are doing this on a single machine, the process of such performance optimization is often called Scale-Up and that’s how we’re going to call it here too.

Workload: CPU or Disk bound?

It is important to talk about Disks and CPUs in this context because the type of workload you have, as well as the number of CPU cores and Disks on the system will determine how much you can Scale Up.

If your working set (amount of data, that is used to answer most search queries) fits in memory, then it is the number of CPU cores that will be limiting the response time of your query eventually.

If, on the other hand, queries will be reading from disk a lot, then you will end up with disk-bound workload.

I won’t go into measuring the workload in this article (leave a comment if that’s of an interest to you), but essentially if any given index is bigger than the available amount of memory on the system, then the performance will be limited by the number of disks you have on the server.

The secret to better performance

So here’s how Elasticsearch executes queries on the high levl.

First, it checks how many shards you have in your cluster (yes, on a single machine, it’s going to be a one node cluster), sends the very same query to all of the shards in parallel, then retrieves the results, aggregates them and sends back to the client.

Key phrase here is sends the query to all of the shards in parallel, because that’s the main area we can get some performance improvements. As you may have guessed already, the number of shards for a given index will determine how many parallel requests server will run internally while searching.

By default, 5 shards are created per index. Which means that if you have a 24 core system, you will only use 5 cores for any single search query in a CPU bound workload. Likewise, if you have a 50-disk RAID, or a fast SSD that supports parallel I/O requests (all flash disks support parallel requests), with default index configuration you will not be making the best use of that powerful hardware.

In other words, all we want to change is the number of shards. But…

Houston, we have a problem

All would be nice, but the problem is that in current version of Elasticsearch, you can’t change the number of shards for existing indexes. You can only set it during the index creation time. Therefore, you will have to reindex your data with the new number of shards. Reindexing is nicely described in The Definitive Guide.

So you want to create an index with the appropriate number of shards (say, 24):

And then load all your data into this new index.

How many shards?

While I have not done any benchmarks to determine the perfect ratio between number of shards and number of CPU cores or disks, performance wise, it’s better to have more shards than fewer because the OS will schedule the overlapping requests properly.

But as a rule of thumb, I would align them either to the number of CPU cores or to the number of disks, depending on the workload.

Oh and if you would like to have some benchmarks done, do let me know in comments.

Will this help all the time?

No. A degree to which it will improve the performance will depend mostly on how much data your search requests will match. If they match hundreds of thousands of documents, then the query will be spending a lot more time in the aggregation phase which is single-threaded for the most part and therefore this optimization will be less helpful.

If, on the other hand, your search requests are highly selective, then this is your ticket to 2, 3 or even up to 5 times faster search queries if you have a lot of cores and/or disks.

Share this Post