solrsolr4solrcloudsolr-query-syntax

Pagination not working while querying multiple solr collecections


I have two collections mdsearch_veevavault and mdsearch_hema

http://rldata:8983/solr/mdsearch_veevavault_shard1_replica1/select?q=%3A&fl=id,desc1&wt=json&indent=true&collection=mdsearch_veevavault,mdsearch_hema&sort=titlesort%20desc,%20id%20asc

When I query with out giving start and number of rows it returns:

{
  "responseHeader":{
    "status":0,
    "QTime":5,
    "params":{
      "q":"*:*",
      "indent":"true",
      "fl":"id,desc1",
      "collection":"mdsearch_veevavault,mdsearch_hema",
      "sort":"titlesort desc, id asc",
      "wt":"json"}},
  "response":{"numFound":6963,"start":0,"docs":[
      {

}
it gives me 6963 results , which is correct

http://rldata:8983/solr/mdsearch_veevavault_shard1_replica1/select?q=%3A&fl=id,desc1&wt=json&indent=true&collection=mdsearch_veevavault,mdsearch_hema&sort=titlesort%20desc,%20id%20asc&rows=25&start=300

-> Now i will add start and number of rows condition start = 300 and rows = 25

{
  "responseHeader":{
    "status":0,
    "QTime":22,
    "params":{
      "q":"*:*",
      "indent":"true",
      "fl":"id,desc1",
      "start":"300",
      "collection":"mdsearch_veevavault,mdsearch_hema",
      "sort":"titlesort desc, id asc",
      "rows":"25",
      "wt":"json"}},
  "response":{"numFound":6960,"start":300,"docs":[
      {}

now the number of records found got decreased to 6960 , can anyone please help me understand what is causing this?, I was in the assumption that numFound will remain constant when the we change start parameter, I am seeing this variation when everytime when i change the start parameter


Solution

  • My guess is that this is caused duplicate ids for the records in both collections. When Solr is merging them to a single result, the id's should be unique - as that's how Solr knows that the documents are different.

    The reason for this happening is that Solr only returns enough documents from each shard/replica to satisfy the start+rows number of documents requested, so for the first request, 10 documents are returned from each server, along with the total count of documents matching the query. These counts are then merged on the server answering the request, together with the list of documents.

    In this case Solr won't know that within the remaining set of documents, there are n overlapping ids. But when you've actually paginated far enough into the result set, Solr will look at those ids in all the result sets returned from the shards, and see that there are duplicate ids - those are then removed from the total count.

    You can solve this by introducing a collection specific parameter for each id (i.e. collectionname_idvalue as the actual value in id), unless you're comfortable with these results being merged.