solrdih

SOLR DataImportHanlder (DIH) Full Indexing - sometimes our index shows near-zero documents during import


We're running SOLR 7.2.1.
We periodically update our index (full re-index) via DataImportHanlder (with clear=true).

Most of the time, the normal number of documents in our index (typically around 250,000) remain visible while DIH is running (because it does not commit until the end of the import).

Intermittently, however, we have an issue where the index will suddenly show only a small subset of the documents (maybe 20,000 of them, for example).

I have not been able to track down the source of this, but I do have a suspicion as to the cause: If someone modifies a product in our website admin area, that will trigger an update to SOLR for that document (with a commit). Is it possible that this commit, by a separate process, will then cause the partially completed DIH data to also be commited? If so, that would explain why we sometimes end up with a smaller subset of documents in the index. When the DIH completes, then the document count returns to normal.

So, do overlapping commits affect each other? In other words, is a commit "global", or does it only affect the data being changed in the current process?

I'd appreciate any clarification on this.

Thanks!

Bill


Solution

  • Transactions are not isolated in Solr - a commit or a rollback will affect all documents queued for the index, and not just those that belong to the thread. This is what you've discovered yourself, and the commit that is issued while DIH is working behind the scenes is what's happening.

    The way around this is usually to drop the use of DIH, and instead index all documents yourself. That will give you complete control over the indexing process. I'd try to avoid having to remove all documents from the index when starting as well - and if possible, track the deleted documents (and delete them as they're removed in the web interface) and do an extra bulk delete later (if you suspect they've not been removed for some reason).

    Another option is to perform the DIH operation on a separate index, then use a collection alias to swap where the searched collection points after indexing has completed. This allows you to do the complete index to a separate collection, and when it has finished, swap the current one for the new one and start serving queries from the one you just built.

    Be aware that if you are changing the underlying data while indexing through DIH, and rely on the direct Solr update to go through, you'll end up with the wrong data in the index, as the direct update will be performed against a different collection / index.

    My choice would be to try to keep the Solr collection in sync with your database, without having to use DIH - and instead rely on direct updates going through. You can then use commitWithin to allow multiple threads to add documents, without having to issue explicit commits in either thread.