solrlucenecomparedocumentsnosql

Lucene comparing document contents


I am trying to compare the contents of documents using solr. I do this by simply using the entire document contents as a query. This works until the documents get large. A document can contain as many as 15k words or more. This results in a max boolean clause exception which has a default value of 1024. Now I could of course increase this value, but even if I increase it to 5k then it will remain impossible to compare documents with large contents.

Is Lucene even suitable for such tasks? And if so, what should I do to accomplish said requirements. If not, what would be an alternative way of comparing the contents of one document with other documents?


Solution

  • I think MoreLikeThis. MoreLikeThis prunes a documents contents to it's higher frequency terms, and just searches with those, which gets around the high numbers of terms (and improving performance). If you are searching for documents similar to an external source:

    MoreLikeThis mlt = new MoreLikeThis(indexreader);
    Query query = mlt.like(someReader, "contents");
    Hits hits = indexsearcher.search(query);
    

    Or if searching for a document already in the index:

    MoreLikeThis mlt = new MoreLikeThis(indexreader);
    Query query = mlt.like(documentNumber);
    Hits hits = indexsearcher.search(query);
    

    Solr also includes a MoreLikeThis handler.