pythonstringunicodedistancejaro-winkler

How to compare similarity between two strings (other than English language) in Python


I want to find the similarity between the two strings Example

string1 = "One"
string2 = "one"

And I expect the answer to be between 0 and 1. For the above two strings, we get 1. Right now I'm using "Jellyfish", a module in python which has the jaro_distance() function. But the downside is I'm only able to compare two strings that contain only English words and other special characters. But I want to compare two strings in other languages, say Punjabi

string1 = "ਬੁੱਧਵਾਰ"
string2 = "ਬੁੱਧਵਾ"

I tried the same jaro_distance() function, but I'm getting

>>score = jellyfish.jaro_distance(unicode(string1), unicode(string2))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)

I tried to encode and decode them, before feeding them to the function. Is there any way to use jaro_distance() for other languages or is there any other module/functions available for this? Can you guys help me with this?


Solution

  • You can use a SequenceMatcher from the built-in module difflib

    Code example:

    import difflib
    
    print(difflib.SequenceMatcher(None, "ਬੁੱਧਵਾਰ", "ਬੁੱਧਵਾ").ratio())
    

    Output:

    0.9230769230769231