Based on several research , i found following important compartive analysis :
if we look on texts, most probably lemmatization should return more correct output right? not only correct, but also shortened version, i have made a experiment on this line :
sentence ="having playing in today gaming ended with greating victorious"
but when i have run code for both lemmatizer and stemmization, i got following result :
['have', 'play', 'in', 'today', 'game', 'end', 'with', 'great', 'victori'] ['having', 'playing', 'in', 'today', 'gaming', 'ended', 'with', 'greating', 'victorious']
first one is stemming and everything looks like fine except victori(it should be victory right) and second one is lemmatization(all of them are correct but in original form), so in this case which option is good?short version and mostly incorrect or long version and correct?
import nltk
from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import PorterStemmer,WordNetLemmatizer
mylematizer =WordNetLemmatizer()
mystemmer =PorterStemmer()
nltk.download('stopwords')
sentence ="having playing in today gaming ended with greating victorious"
words =word_tokenize(sentence)
# print(words)
stemmed =[mystemmer.stem(w) for w in words]
lematized=[mylematizer.lemmatize(w) for w in words ]
print(stemmed)
print(lematized)
# mycounter =CountVectorizer()
# mysentence ="i love ibsu. because ibsu is great university"
# # print(word_tokenize(mysentence))
# # print(sent_tokenize(mysentence))
# individual_words=word_tokenize(mysentence)
# stops =list(stopwords.words('english'))
# words =[w for w in individual_words if w not in stops and w.isalnum() ]
# reduced =[mystemmer.stem(w) for w in words]
# new_sentence =' '.join(words)
# frequencies =mycounter.fit_transform([new_sentence])
# print(frequencies.toarray())
# print(mycounter.vocabulary_)
# print(mycounter.get_feature_names_out())
# print(new_sentence)
# print(words)
# # print(list(stopwords.words('english')))
Here is an example of what parts of speech the lemmatizer is using for the words in your string:
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import word_tokenize, pos_tag
from collections import defaultdict
tag_map = defaultdict(lambda : wordnet.NOUN)
tag_map['J'] = wordnet.ADJ
tag_map['V'] = wordnet.VERB
tag_map['R'] = wordnet.ADV
sentence = "having playing in today gaming ended with greating victorious"
tokens = word_tokenize(sentence)
wnl = WordNetLemmatizer()
for token, tag in pos_tag(tokens):
print('found tag', tag[0])
lemma = wnl.lemmatize(token, tag_map[tag[0]])
print(token, "lemmatized to", lemma)
The output:
found tag V
having lemmatized to have
found tag N
playing lemmatized to playing
found tag I
in lemmatized to in
found tag N
today lemmatized to today
found tag N
gaming lemmatized to gaming
found tag V
ended lemmatized to end
found tag I
with lemmatized to with
found tag V
greating lemmatized to greating
found tag J
victorious lemmatized to victorious
Lemmatization distills words to their foundational form. It is similar to stemming but brings context to the words, thus linking words with similar meanings to one word. The fancy linguistic word is “morphology”. So how do the words relate to each other in a given language? If you look at the output above, the ing verbs are being parsed as nouns. ing verbs, while verbs, can also be utilized as nouns: I love swimming. The verb is love and the noun is swimming. And that is how the tags are being interpreted above. And to be honest, your above sentence is not a sentence at all. I would not say one is correct over the other, but consider lemmatization as more powerful when parts of speech are utilized correctly in a sentence that has either an independent clause or dependent along with independent clauses.