solrlucenemorelikethis

Term vectors in Solr


I'm trying to use the MoreLikeThis Solr's feature to find similar document based on some other document, but the I don't quite understand how some of this functionality works.

As it says here, the MoreLikeThis component works best, when the termVectors are stored. And here comes my confusion.

Is it enough that I enable the flag termVectors on a field (let's say the field contains a movie review text) in Solr's schema.xml file? Will it make Solr calculate the termVectors for a given field after inserting it, store it and then use the calculcated termVectors in subsequent calls to the MoreLikeThis handler?


Solution

  • Short answer is NO, you need to re-index after such a schema change. Having the term vector enabled, will speed up the process of finding the interesting terms from the original input document ( if this document is in the index). Second phase timing (when More Like This query happens), will remain the same. For more information about how the MLT works [1] .

    In general, when applying such changes to the schema, you need to re-index your documents to make Solr builds the related data structures(the term vector is a mini index per document, and requires specific files to be stored on disk[2] N.B. this will increase your disk utilisation)

    [1] https://www.slideshare.net/AlessandroBenedetti/advanced-document-similarity-with-apache-lucene

    [2] https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/codecs/lucene50/Lucene50TermVectorsFormat.html