[SOLVED] The reverse process of stemming

The reverse process of stemming

I use a lucene snowball analyzer to perform stemming . The results are not meaningful words . I referred this question .

One of the solution is to use a database that contains a map between the stemmed version of the word to one stable version of the word . (Example from communiti to community no matter what the base was for communti (communities / or some other word))

I want to know if there is a database which performs such a function.

Solution

It is theoretically impossible to recover a specific word from a stem, since one stem can be common to many words. One possibility, depending on your application, would be to build a database of stems each mapped to an array of several words. But you would then need to predict which one of those words is appropriate given a stem to re-convert.

As a very naive solution to this problem, if you know the word tags, you could try storing words with the tags in your database:

run:
   NN:  runner
   VBG: running
   VBZ: runs

Then, given the stem "run" and the tag "NN", you could determine that "runner" is the most probable word in that context. Of course, that solution is far from perfect. Notably, you'd need to handle the fact that the same word form might be tagged differently in different contexts. But remember that any attempt to solve this problem will be, at best, an approximation.

Edit: from the comments below, it looks like you probably want to use lemmatization instead of stemming. Here's how to get the lemmas of words using the Stanford Core NLP tools:

import java.util.*;

import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.ling.CoreAnnotations.*;

Properties props = new Properties();

props.put("annotators", "tokenize, ssplit, pos, lemma");
pipeline = new StanfordCoreNLP(props, false);
String text = "Hello, world!";
Annotation document = pipeline.process(text);

for(CoreMap sentence: document.get(SentencesAnnotation.class)) {
    for(CoreLabel token: sentence.get(TokensAnnotation.class)) {
        String word = token.get(TextAnnotation.class);
        String lemma = token.get(LemmaAnnotation.class);
    }
}