apache-sparkelasticsearchpagination

Inconsistent results when sorting documents by _doc


I want to fetch elasticsearch hits using the sort+search_after paging mechanism.

The elasticsearch documentation states:

_doc has no real use-case besides being the most efficient sort order. So if you don’t care about the order in which documents are returned, then you should sort by _doc. This especially helps when scrolling.

However, when performing the same query multiple times, I get different results. More specifically, the first hit alternates randomly between two different hits, where the returned sort field is 0 for one hit, and some specific number for the other.

This obviously breaks the paging as it relies on the value returned in sorting to be later fed into sort_after for the next query.

No data is being written to the index while I am querying it, so this is not because of refreshes.

My questions are therefore:

  1. Is it wrong to sort by _doc for paging? Seems the results I get are inconsistent.
  2. How does sorting by _doc work internally? The documentation is lacking in this regard as it simply states the sort is performed by "index order".

The data was written to the index in parallel using Spark. I thought the problem might have been the parallel write combined with the "index order" sorting, however I did not manage to replicate this behavior with other indicies which were also written to in Spark.

es 7, index contains 2 shards, one primary and one replica

cheers.


Solution

  • The reason this happened is that the index consists of 2 shards. One primary and one replica. The documents were not indexed in the same order. Thus, the order of the results depends on the shard they were returned from. This is fine when using scrolling because Elasticsearch keeps an inner state of the results, but not with paging, which is stateless.