I was trying to complete an NLP assignment using the Jaccard Distance metric function jaccard_distance()
built into nltk.metrics.distance
, when I noticed that the results from it did not make sense in the context I would expect.
When I examined the implementation of jaccard_distance()
in the online source, I noticed that it was not consistent with the mathematical definition for the Jaccard index.
Specifically, the implementation in nltk
is:
return (len(label1.union(label2)) - len(label1.intersection(label2)))/len(label1.union(label2))
but according to the definition, the numerator term should only involve an intersection of the two sets, which means the correct implementation should be:
return len(label1.intersection(label2))/len(label1.union(label2))
when I wrote my own function using the latter, I indeed obtained correct answers to my assignment. For example, I was tasked to recommend a correct spelling suggestion for the misspelled word cormulent, from a comprehensive corpus of words (built in nltk
), using Jaccard Distance on trigrams of the words.
When I used the jaccard_distance()
from nltk
, I instead obtained so many perfect matches (the result from the distance function was 1.0
) that just were nowhere near being correct.
When I used my own function the latter implementation, I was able to get a spelling recommendation of corpulent, at a Jaccard Distance of 0.4 from cormulent, a decent recommendation.
Could there be a bug with jaccard_distance()
in nltk
?
The two formulae you quote do not do the exact same thing, but they are mathematically related. The first definition you quote from the NLTK package is called the Jaccard Distance (DJaccard). The second one you quote is called the Jaccard Similarity (SimJaccard).
Mathematically, DJaccard = 1 - SimJaccard. The intuition here is that the more similar they are (the higher the SimJaccard), the lower is the distance (and hence, DJaccard).