I am new to Python and NLTK so please bear with me. I wish to find the sense of a word in the context of a sentence. I am using the Lesk WSD algorithm but it is giving different outputs every time I run it. I know that Lesk has some level of inaccuracy. But, I think a POS tag will increase accuracy.
The Lesk algorithm takes a POS tag as an argument, but it takes 'n','s','v' as an input and not 'NN','VBP' or other POS tags which are outputted by the pos_tag() function. I would like to know how to tag words in the form of 'n','s','v', or if there is a method in which I can convert the 'NN','VBP' and other tags into 'n','s','v', so I can give them as an input to the lesk(context_sentence,word,pos_tag) function.
I am calculating the sentiment score of every word using SentiWordNet afterwards.
from nltk.wsd import lesk
from nltk import word_tokenize
import nltk, re, pprint
from nltk.corpus import sentiwordnet as swn
def word_sense():
sent = word_tokenize("He should be happy.")
word = "be"
pos = "v"
score = lesk(sent,word,pos)
print(score)
print (str(score),type(score))
set1 = re.findall("'([^']*)'",str(score))[0]
print (set1)
bank = swn.senti_synset(str(set1))
print (bank)
word_sense()
nltk.wsd.lesk
does not return score, it returns the predicted Synset
:
>>> from nltk.corpus import wordnet as wn
>>> from nltk.corpus import sentiwordnet as swn
>>> from nltk import word_tokenize
>>> from nltk.wsd import lesk
>>> sent = word_tokenize("He should be happy".lower())
>>> lesk(sent, 'be', 'v')
Synset('equal.v.01')
lesk
is not perfect, it should only be used as a baseline system for WSD.
Although this is nice:
>>> ss = str(lesk(sent, 'be', 'v'))
>>> re.findall("'([^']*)'",ss)
['equal.v.01']
There's a simpler to get the synset identifier:
>>> lesk(sent, 'be', 'v').name()
u'equal.v.01'
Then you can do:
>>> swn.senti_synset(lesk(sent, 'be', 'v').name())
SentiSynset('equal.v.01')
To convert POS tag
to WN POS
, you can simply try: Converting POS tags from TextBlob into Wordnet compatible inputs