pythonlevenshtein-distance

looking for python library which can perform levenshtein/other edit distance at word-level


I've seen a bunch of similar questions on SO/elsewhere but none of the answers quite satisfy my needs, so I don't think this is a dup.

Also, I totally know how to implement this myself, but I'm trying not to have to re-invent the wheel.

Does anyone know any python packages which can perform levenshtein/other edit-distance comparing 2 lists of words (I've found a few), but also allow one to specify your own costs for insertion, deletion, substitution, and transpositions?

basically, I want the distances computed to be the number of edits on words in the sentences, not on the number of characters the sentences differ by.

I'm trying to replace a custom python extension module which is actually written in C, using python2's C api. I could re-write in either pure-python or cython, but I'd rather simply add a dependency to the project. The only problem is that this code allows one to specify your own costs for the various options, and I haven't found a package which allows this so far.


Solution

  • the python distancia package does this very well:

    from distancia import Levenshtein
    s1 = 'WAKA WAKA QB WTF BBBQ WAKA LOREM IPSUM WAKA'
    s2 = 'WAKA OMFG QB WTF WAKA WAKA LOREM IPSUM WAKA'
    distance = Levenshtein().levenshtein_distance_words(s1, s2)
    print(distance)
    

    output:

    2.0