javaindexingluceneclustered-indexcarrot2

Carrot2 doesn't show clusters all containing specific word on search


I selected some rows all containing specific word like StackOverFlow from my database and saved them in a text file. Then, i used Lucene to index the file contents.

When i try to search StackOverFlow on indexed files using Carrot2, it returns no document, But for other words that i know they are exist at least in one document, it returns some of them.

In Carrot2 document, there is an explanation about an attribute called Maximum word document frequency:

Maximum word document frequency. The maximum document frequency allowed for words as a fraction of all documents. Words with document frequency larger than maxWordDf will be ignored. For example, when maxWordDf is 0.4, words appearing in more than 40% of documents will be be ignored. A value of 1.0 means that all words will be taken into account, no matter in how many documents they appear. This attribute may be useful when certain words appear in most of the input documents (e.g. company name from header or footer) and such words dominate the cluster labels. In such case, setting maxWordDf to a value lower than 1.0, e.g. 0.9 may improve the clusters.

Another useful application of this attribute is when there is a need to generate only very specific clusters, i.e. clusters containing small numbers of documents. This can be achieved by setting maxWordDf to extremely low values, e.g. 0.1 or 0.05.

So, when i do set maxWordDf to 1.0, nothing changes and it still shows no document on search.

How do i resolve my problem?


Solution

  • The reason documents are missing in search results is usually a mismatch between the analyzer used to index documents and the analyzer used by Carrot2 during the search. By default, Carrot2 uses Lucene's StandardAnalyzer, you provide a different analyzer using the LuceneDocumentSource.analyzer attribute.