Hello fellow technicians,
Let's assume we have a (PHP) website with millions of visitors a month and we running a SolR index on the website with 4 million documents hosted. Solr is running on 4 separate servers where one server is the master and other 3 servers are replicated.
There can be inserted thousands of documents into Solr every 5 minutes. And besides that, user can update their account which also should trigger a solr update.
I am looking for a safe strategy to rebuild the index fast and safe without missing any document. And to have a safe delta/update strategy. I have thought about a strategy and I want to share it with experts here to hear their opinion about and if I should go for this approach or if they might advise something (totally) different.
Solr DataImport
For all operations I want to use one data-import handler. I want to mix data and delta import into one config file like the DataImportHandlerDeltaQueryViaFullImport. We are using a MySQL database as datasource.
Rebuilding index
For rebuilding the index I have the following in mind; we create a new core called 'reindex' near the 'live' core. With the dataimporthandler we completely rebuild the whole document-set (4 million documents) which takes about 1-2 hours in total. On the live index there are still every minute some updates, inserts and deletions.
After the rebuild, which took about 1-2 hours, the new index is still not really up-to-date anymore. To make the delay smaller we do one 'delta' import against the new core to commit all changes from the last 1-2 hours. When this is done which do a core-swap. The normal 'delta' import handler which runs every minute will pick this new core up.
Commiting updates to live core
To keep our live core in track we run the delta import every minute. Because of the core swap the reindex core (which is now the live core) will be tracked en kept up-to-date. I am guessing it should not really be a problem if this index is delayed for some minutes because dataimport.properties will be swapped as well? The delta-import has overtake these minutes of delay but should be possible.
I hope you understand my situation and my strategy and could advise if i'm doing it the right way in your eyes. Also I would like to know if there are any bottlenecks where I didn't think about? We are running Solr version 1.4.
Some question I do have is, what about replication? If the master server swaps the core how does the salves handle this?
And are there any risks with losing documents when swapping, etc?
Thanks in advance!
Good (and hard) question!
The full-import is a very heavy operation, in general it's better to run delta queries to only update your index to the latest changes in your RDMS. I got why you swap the master when you need to do a full-import: you keep up-to-date the live core using delta-import while the full-import is running on the new core, since it takes two hours. Sounds good, as long as the full-import is not used that frequently.
Regarding the replication, I would make sure that there isn't any replication in progress before swapping the master core. For more details about how replication works you can have a look at the Solr wiki if you haven't done it yet.
Furthermore, I would make sure that there isn't any delta-import running on the live core before swapping the master core.