I am trying to optimize indexing speed in ElasticSearch, as we are constantly reindexing indexes every hour, and so the faster we are able to re-index our data, the less of a lag we can achieve.
I came across this article which talks about reaching a re-indexing throughput of 100K: https://thoughts.t37.net/how-we-reindexed-36-billions-documents-in-5-days-within-the-same-elasticsearch-cluster-cd9c054d1db8#.4w3kl9ebf, and this StackOverflow question which achieves higher: ElasticSearch - high indexing throughput.
My question is whether it is possible to achieve a sustained indexing throughput of 1 million documents per second, and if so, how?
It will depend on a few factors, but why should it be impossible? Here are a few key factors, that will speed up the indexing process:
As an example, with small documents and a single eight core machine, I was able to index at about 70k-120k docs/s. Throw in a few more cores or machines and you could approach 1M docs/s.
Update: Another test run with Elasticsearch 6.1.0, on a single 32-core E5, with 64G JVM heap. Here, esbulk could index about 330000 docs/s, using 10M small documents of sizes 20-40 bytes.
Disclaimer: I wrote esbulk. The README contains a few measurements - maximum at the moment is at about 300k docs/s.