solrsolr8

autoSoftCommit setting in solr8.8.0


I current indexing takes about 1:30 hr. That is too long to wait since I wanted NRT updates, I have enabled autoCommit and autoSoftCommit as below

<autoCommit>
     <maxTime>${solr.autoCommit.maxTime:600000}</maxTime> <!-- 10 minutes -->
     <openSearcher>false</openSearcher>
</autoCommit>

<autoSoftCommit>
  <maxTime>${solr.autoSoftCommit.maxTime:300000}</maxTime> <!-- 5 minutes -->
</autoSoftCommit>

The problem is every time full import starts, it clears old documents which defeats the purpose of enabling the autoSoftCommit. I don't know what I am missing here. My expectation is keep documents from last index and add new documents to the existing or replace duplicate documents.

If I disable the autoSoftCommit then it does not delete the documents.

The indexing is started by cronjob. The URL is http://localost:8983/solr/mycore/dataimport?clean=true&commit=true&command=full-import

Appreciate any help. Thanks


Solution

  • When you commit, you end up clearing the index if you've issued a delete. Don't issue commits if you don't want deletes to be visible. You can't have it both ways - you can't do a full index that clears the index first, and then expects the documents to appear progressively afterwards without committing the delete as well. A full import is just that - it cleans out the index, the imports any documents that currently exists, and then commits. If you want to commit earlier, that means that the cleaning of the index will be visible.

    In general when talking about near realtime we're talking about submitting documents through the regular /update endpoints and having those changes be visible within a second or two. When you're using the dataimporthandler with a full-import, the whole import will have to run before any changes becomes visible.

    If you still want to use the dataimporthandler (which has been removed from Solr core in 9 and is now a community project), you'll have to configure delta imports instead of using the full import support. That way you only get changes for those documents that have been added, removed or changed - and you don't have to issue the delete (the clean part of your URL) - since any deletions should be handled by your delta queries. This requires that your database has a way to track when a given row changed, so that you can only import and process those rows that actually have changed (if you want it to be efficient, at least).

    If you have no way of tracking this in your database layer, you're stuck with doing it the way you're currently doing - but in that case, disable the soft commit and let the changes be visible after the import has finished.

    A hybrid approach is also possible, do delta updates and manual submissions to /update during the day, then run a full index at night to make sure that Solr and your database matches. This will depend on your requirement for how quickly you need to handle any differences between Solr and your database (i.e. if you miss submitting a delete - is it critical if it doesn't get removed until late at night?)