solrcluster-analysiscarrot2

Force or boost words in carrot2 clustering labels


I am using Carrot2 to cluster query results from Solr. Is is possible to force (or at least boost) the occurrence of certain words in the labels, in either Lingo, STC or k-means?

With Lingo, this is already possible with the option "Title word boost", which gives more weight to words appearing in the document title. Can this be extended to other words that I can provide?

I imagine it should be at least possible to append the desired words to the string which is being taken by the "Title word boost" option to let the word boost work, but maybe that is not the right approach.

What would be the way to do it?


Solution

  • Currently, the possibility to boost arbitrary words is not exposed in the API, so only the words included in the title can be promoted.

    The code that does the boosting is in:

    https://github.com/carrot2/carrot2/blob/master/core/carrot2-util-text/src/org/carrot2/text/vsm/TermDocumentMatrixBuilder.java#L159

    You could add another attribute that would, for example, take a comma-separated list of words and boost them too.