elasticsearchelasticsearch-performance

Would ordering of documents when indexing improve Elasticsearch search performance?


I'm indexing about 40M documents into Elasticsearch. It's usually a one off data load and then we do run queries on top. There's no further updates to the index itself. However the default settings of Elasticsearch isn't getting me the throughput I expected.

So in long list of things to tune and verify, I was wondering whether ordering by a business key would help improve the search throughput. All our analysis queries use this key and it is indexed as a keyword already and we do a filter on it like below,

{
    "query" : {
        "bool" : {
            "must" : {
                "multi_match" : {
                  "type": "cross_fields",
                  "query":"store related query", 
                  "minimum_should_match": "30%",      
                  "fields": [ "field1^5", "field2^5", "field3^3", "field4^3", "firstLine", "field5", "field6", "field7"] 
                }
            },
            "filter": {
                "term": {
                    "businessKey": "storename"
                }
            }
            
        }
    }
}

This query is run in a bulk fashion about 20M times in a matter of few hours. Currently I cannot go past 21k/min. But that could be because of various factors. Any tips to improve performance for this sort of work flow (load once and search a lot) would be appreciated.

However I'm particularly interested to know if I could order the data first by business key when I'm indexing so that data for that businessKey lives within one single Lucene segment and hence the lookup would be quicker. Is that line of thoughts correct? Is this something ES already does given that it's keyword term?


Solution

  • It's a very good performance optimization use-case and as you already mentioned there will be a list of performance optimization which you need to do.

    I can see, you are already building the query correctly that is that filtering the records based on businessKey and than search on remaining docs, this way you are already utilizing the filter-cache of elasticsearch.

    As you have huge number of documents ~40M docs, it doesn't make sense to put all of them in single segments, default max size of segment is 5 GB and beyond that merge process will be blocked on segments, hence its almost impossible for you to have just 1 segment for your data.

    I think couple of things which you can do is:

    1. Disable refresh interval when you are done with ingesting your data and preparing index for search.
    2. As you are using the filters, your request_cache should be used and you should monitor the cache usage when you are querying and monitor how many times results are coming from cache.
    GET your-index/_stats/request_cache?human
    
    1. Read throughput is more when you have more replicas and if you have nodes in your elasticsearch cluster makes sure these nodes, have replicas of your ES index.
    2. Monitor the search queues on each nodes and make sure its not getting exhausted otherwise you will not be able to increase the throughput, refer threadpools in ES for more info

    You main issue is around throughput and you want to go beyond current limit of 21k/min, so it requires a lot of index and cluster configuration optimization as well and I have written short tips to improve search performance please refer them and let me know how it goes.