[SOLVED] Modifying stop words list

Modifying stop words list

I would like to tune the carrot2 clusters to avoid labels, that do not start with prepositions -- for Russian language it looks quite strange to see a word in a grammatical case (non Nominative) and not have a preposition.

The clustering is done using Apache Solr.

Examples:

Минске ([in] Minsk, missing preposition В in the beginning).
Самом Деле ([in] fact, missing preposition На in the beginning).

I have tried two independent things:

configure core/clustering/carrot2/stopwords.ru -- and remove prepositions in questions from there
unpack carrot2-mini-3.9.0.jar, remove entries from stopwords.ru and pack back into the jar.

None of the above has any effect on the cluster labels. Is there some other obvious thing to try? Or perhaps, change the approach of tuning altogether?

Thank you!

Solution

Removing prepositions from the stop words files should do the trick. With the modified stop words files, the prepositions can still be missing due to the statistics of the data -- if some occurrences of Минске are prefixed with "in" and others are not, the algorithm may pick the shorter version (without prepositions) as the more representative.

Labels in core/clustering/carrot2/stopwords.ru should take precedence over the labels contained in carrot2-mini-3.9.0.jar.

When it comes to the Lingo clustering algorithm, there's no direct way to directly affect the number of words per label, but you can try increasing phrase label boost and lowering truncated label threshold.

A complete list of clustering algorithm parameters is in Carrot2 documentation. You can pass parameter overrides as part of Solr results clustering requests.