javalucenemorelikethis

What is the purpose of "fieldName" in Lucene MoreLikeThis.like(fieldName, reader)?


I was trying to "upgrade" this MoreLikeThis example to Lucene 5.2.1. I was able to make it run, but I don't understand the purpose of the argument fieldName of the method like(String fieldName, Reader... readers).

The documents were created and indexed as

Document doc = new Document();
doc.add(new StringField("id", id, Store.YES));
doc.add(new Field("title", title, type));
doc.add(new Field("content", content, type));

The query was initialized as follows

MoreLikeThis mlt = new MoreLikeThis(reader);
mlt.setFieldNames(new String[] { "title", "content" });
Reader sReader = new StringReader(searchForSimilar);
Query query = mlt.like("title", sReader);

As I said, it worked as expected. Similar docs were properly recovered and ranked. So, since de API doesn't explain the argument, I did some experiments: instead "title", I changed it to "content", "xxx" and NULL.

All of them returned the same documents, with the same score...

I tried to look inside Lucene source, the argument is used to call addTermFrequencies, and then analyzer.tokenStream(fieldName, r). After that the code become to complex to my knowledge...

So, the argument seems be "important", but as I told, it made no difference.

Does anyone knows its purpose?


Solution

  • It's just for the analyzer.

    In order to query effectively, MLT needs to know how to tokenize your content. Calls to Analyzer.tokenStream must be passed a fieldname, because some analyzers need it.

    Many don't though. StandardAnalyzer, for instance, does not use that parameter (take a look at StandardAnalyzer.createComponents, and you'll see it never actually does anything with it). For StandardAnalyzer, and indeed most analyzers, in my experience, that argument could be anything. The field doesn't even have to exist.

    An example of one that does use it, is PerFieldAnalyzerWrapper. If you were using that, it would need to know the fieldname to determine which analysis method to use.

    As far as I know, it isn't used for anything else. like(int docnum) does not require a fieldname because it drives off of the indexed term vectors directly, which are already analyzed.