I'm indexing about 40M documents into Elasticsearch. It's usually a one off data load and then we do run queries on top. There's no further updates to the index itself. However the default settings of Elasticsearch isn't getting me the throughput I expected.
So in long list of things to tune and verify, I was wondering whether ordering by a business key would help improve the search throughput. All our analysis queries use this key and it is indexed as a keyword already and we do a filter on it like below,
{
"query" : {
"bool" : {
"must" : {
"multi_match" : {
"type": "cross_fields",
"query":"store related query",
"minimum_should_match": "30%",
"fields": [ "field1^5", "field2^5", "field3^3", "field4^3", "firstLine", "field5", "field6", "field7"]
}
},
"filter": {
"term": {
"businessKey": "storename"
}
}
}
}
}
This query is run in a bulk fashion about 20M times in a matter of few hours. Currently I cannot go past 21k/min. But that could be because of various factors. Any tips to improve performance for this sort of work flow (load once and search a lot) would be appreciated.
However I'm particularly interested to know if I could order the data first by business key when I'm indexing so that data for that businessKey lives within one single Lucene segment and hence the lookup would be quicker. Is that line of thoughts correct? Is this something ES already does given that it's keyword term?
It's a very good performance optimization use-case and as you already mentioned there will be a list of performance optimization which you need to do.
I can see, you are already building the query correctly that is that filtering the records based on businessKey
and than search on remaining docs, this way you are already utilizing the filter-cache of elasticsearch.
As you have huge number of documents ~40M docs, it doesn't make sense to put all of them in single segments, default max size of segment is 5 GB and beyond that merge process will be blocked on segments, hence its almost impossible for you to have just 1 segment for your data.
I think couple of things which you can do is:
GET your-index/_stats/request_cache?human
You main issue is around throughput and you want to go beyond current limit of 21k/min, so it requires a lot of index and cluster configuration optimization as well and I have written short tips to improve search performance please refer them and let me know how it goes.