When scanning an ElasticSearch index it is not possible to apply any sorting according to the documentation. But is there any definition at all about the order of the results during this process? If yes, is it predictable?
Background info:
I need to do operations on 5M documents regulary, each batch of 1.000 docs taking about 1 minute to be processed. As I cannot make sure that the process will be finished each time it is ran, I would love to make it pick up its work where it has been interrupted last time, e.g. if it the scroll result was sorted by ID
(I know, it is not), I would keep track of the last processed ID
in my code, and with the next run no longer process any document with ID <= lastProcessedId
, to make sure that every document gets processed regularly.
Btw: by "processing the document" I do not mean writing additional info back to the index, rather updating some other stuff in my database. Writing a timestamp to the indexed document would not help in my case, since one of the reason for the process being interrupted could be that the index is replaced with a fresh index (re-built from scratch). Writing a processed timestamp to the database is also not a desired option for me, because the iteration performance is the reason why I am using the index to scroll in the first place....
No, the sort order is not predictable. I was going to suggest using timestamps, but then I read the rest of your question :)
Really, the only way to make a scanned search "resumable" is to divide your docs into tranches on some field, eg timestamp or ID, and to use a range query to scroll through just one tranche at a time.