pythonnltkstemminglemmatization

Comparison between stemmiation and lemmatization


Based on several research , i found following important compartive analysis :

comparative analysis

if we look on texts, most probably lemmatization should return more correct output right? not only correct, but also shortened version, i have made a experiment on this line :

sentence ="having playing  in today gaming ended with greating victorious"

but when i have run code for both lemmatizer and stemmization, i got following result : ['have', 'play', 'in', 'today', 'game', 'end', 'with', 'great', 'victori'] ['having', 'playing', 'in', 'today', 'gaming', 'ended', 'with', 'greating', 'victorious']

first one is stemming and everything looks like fine except victori(it should be victory right) and second one is lemmatization(all of them are correct but in original form), so in this case which option is good?short version and mostly incorrect or long version and correct?

        import nltk
        from nltk.tokenize import word_tokenize,sent_tokenize
        from nltk.corpus import stopwords
        from sklearn.feature_extraction.text import  CountVectorizer
        from nltk.stem import PorterStemmer,WordNetLemmatizer
        mylematizer =WordNetLemmatizer()
        mystemmer =PorterStemmer()
        nltk.download('stopwords')
        sentence ="having playing  in today gaming ended with greating victorious"
        words =word_tokenize(sentence)
        # print(words)
        stemmed =[mystemmer.stem(w)  for w in words]
        lematized=[mylematizer.lemmatize(w) for w in words ]
        print(stemmed)
        print(lematized)
        # mycounter =CountVectorizer()
        # mysentence ="i love ibsu. because ibsu is great university"
        # # print(word_tokenize(mysentence))
        # # print(sent_tokenize(mysentence))
        # individual_words=word_tokenize(mysentence)
        # stops =list(stopwords.words('english'))
        # words =[w  for w in  individual_words if w not in  stops  and  w.isalnum() ]
        # reduced =[mystemmer.stem(w) for w  in words]
        
        # new_sentence =' '.join(words)
        # frequencies =mycounter.fit_transform([new_sentence])
        # print(frequencies.toarray())
        # print(mycounter.vocabulary_)
        # print(mycounter.get_feature_names_out())
        # print(new_sentence)
        # print(words)
        # # print(list(stopwords.words('english')))

Solution

  • Here is an example of what parts of speech the lemmatizer is using for the words in your string:

    import nltk
    nltk.download('wordnet')
    from nltk.corpus import wordnet
    from nltk.stem.wordnet import WordNetLemmatizer
    from nltk import word_tokenize, pos_tag
    from collections import defaultdict
    
    tag_map = defaultdict(lambda : wordnet.NOUN)
    tag_map['J'] = wordnet.ADJ
    tag_map['V'] = wordnet.VERB
    tag_map['R'] = wordnet.ADV
    
    sentence = "having playing in today gaming ended with greating victorious"
    tokens = word_tokenize(sentence)
    wnl = WordNetLemmatizer()
    for token, tag in pos_tag(tokens):
        print('found tag', tag[0])
        lemma = wnl.lemmatize(token, tag_map[tag[0]])
        print(token, "lemmatized to", lemma)
    

    The output:

    found tag V
    having lemmatized to have
    found tag N
    playing lemmatized to playing
    found tag I
    in lemmatized to in
    found tag N
    today lemmatized to today
    found tag N
    gaming lemmatized to gaming
    found tag V
    ended lemmatized to end
    found tag I
    with lemmatized to with
    found tag V
    greating lemmatized to greating
    found tag J
    victorious lemmatized to victorious
    

    Lemmatization distills words to their foundational form. It is similar to stemming but brings context to the words, thus linking words with similar meanings to one word. The fancy linguistic word is “morphology”. So how do the words relate to each other in a given language? If you look at the output above, the ing verbs are being parsed as nouns. ing verbs, while verbs, can also be utilized as nouns: I love swimming. The verb is love and the noun is swimming. And that is how the tags are being interpreted above. And to be honest, your above sentence is not a sentence at all. I would not say one is correct over the other, but consider lemmatization as more powerful when parts of speech are utilized correctly in a sentence that has either an independent clause or dependent along with independent clauses.