java-7stanford-nlpeclipse-3.4lemmatization

Stanford CorpNLP returning wrong results


I am trying lemmatization with stanford corenlp following this question. My environment is:-

my code snippet is:-

//...........lemmatization starts........................

    Properties props = new Properties(); 
    props.put("annotators", "tokenize, ssplit, pos, lemma"); 
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props, false);
    String text = "painting"; 
    Annotation document = pipeline.process(text);  

    List<edu.stanford.nlp.util.CoreMap> sentences = document.get(SentencesAnnotation.class);

    for(edu.stanford.nlp.util.CoreMap sentence: sentences) 

    {    
        for(CoreLabel token: sentence.get(TokensAnnotation.class))
        {       
            String word = token.get(TextAnnotation.class);      
            String lemma = token.get(LemmaAnnotation.class); 
            System.out.println("lemmatized version :" + lemma);
        }
    }

    //...........lemmatization ends.........................

the output i get is:-

lemmatized version :painting

where i expect

lemmatized version :paint

Please enlighten me.


Solution

  • The problem in this example is that the word painting can be the present participle of to paint or a noun and the output of the lemmatizer depends on the part-of-speech tag assigned to the original word.

    If you run the tagger only on the fragment painting, then there is no context that could help the tagger (or a human) to decide how the word should be tagged. In this case it picked the tag NN and the lemma of the noun painting is in fact painting.

    If you run the same code with the sentence "I am painting a flower." the tagger should correctly tag painting as VBG and the lemmatizer should return paint.