Suppose I have the following text:
text = "this is a foo bar bar black sheep foo bar bar black sheep foo bar bar black sheep shep bar bar black sentence"
I can calculate the PMI for bigram using NLTK as follow:
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(word_tokenize(text))
for i in finder.score_ngrams(bigram_measures.pmi):
print(i)
which gives:
(('is', 'a'), 4.523561956057013)
(('this', 'is'), 4.523561956057013)
(('a', 'foo'), 2.938599455335857)
(('sheep', 'shep'), 2.938599455335857)
(('black', 'sentence'), 2.523561956057013)
(('black', 'sheep'), 2.523561956057013)
(('sheep', 'foo'), 2.353636954614701)
(('bar', 'black'), 1.523561956057013)
(('foo', 'bar'), 1.523561956057013)
(('shep', 'bar'), 1.523561956057013)
(('bar', 'bar'), 0.5235619560570131)
Now to check my own understanding I want to find the PMI for PMI('black', 'sheep'). PMI formula is given as:
There are 4 instances of 'black' in the text, there are 3 instances of 'sheep' in the text and black and sheep come together 3 times, the length of the text is 23. Now following the formula I do:
np.log((3/23)/((4/23)*(3/23)))
That gives 1.749199854809259 rather than 2.523561956057013. I wonder why is there a discrepancy here? what am I missing here?
Your PMI formula uses a logarithm in base 2 instead of a base e.
From NumPy's documentation, numpy.log
is a Natural logarithm in base e, which is not what you want.
The following formula would give you the result of 2.523561956057013
:
math.log((3/23)/((4/23)*(3/23)), 2)