I am currently using the Snowball Stemmer (Porter2) in my Java Project to stem words etc. However, it stems words that either don't necessarily need to be stemmed or stem's them too much? For example, online -> onlin
, why -> whi
, raise-> rais
, appreciate -> appreci
.
Is there any way that I could try prevent this unnecessary stemming as I would like it to give me words that are meaningful, as well stemming words that need to be stemmed, such as treating -> treat
, records -> record
, development -> develop
etc by implementing some sort of dictionary that would avoid these words being stemmed? Or if there are any other stemmer similar to Snowball that are less precise in their stemming abilities?
Thanks for all the help.
Here is my function.
The main job of Porter Stemmer is grouping words into a set of stems. These stemmed words are good because Porter exists for search objective, ie it doesn't matter if a stem is a real origin, what matters is that it is the same for the whole family of words.
As you are working for the objective of Term frequency analysis and Collocations, I suppose you need a light stemmer or a minimal one.
You can check this article for stemmers used in Lucene. You can notice:
minimal_english
The EnglishMinimalStemmer in Lucene, which removes plurals