pythonstringjaro-winklerpdist

String Distance Matrix in Python using pdist


How to calculate Jaro Winkler distance matrix of strings in Python?

I have a large array of hand-entered strings (names and record numbers) and I'm trying to find duplicates in the list, including duplicates that may have slight variations in spelling. A response to a similar question suggested using Scipy's pdist function with a custom distance function. I've tried to implement this solution with the jaro_winkler function in the Levenshtein package. The problem with this is that the jaro_winkler function requires a string input, whereas the pdict function seems to require a 2D array input.

Example:

import numpy as np
from scipy.spatial.distance import pdist
from Levenshtein import jaro_winkler

fname = np.array(['Bob','Carl','Kristen','Calr', 'Doug']).reshape(-1,1)
dm = pdist(fname, jaro_winkler)
dm = squareform(dm)

Expected Output - Something like this:

          Bob  Carl   Kristen  Calr  Doug
Bob       1.0   -        -       -     -
Carl      0.0   1.0      -       -     -
Kristen   0.0   0.46    1.0      -     -
Calr      0.0   0.93    0.46    1.0    -
Doug      0.53  0.0     0.0     0.0   1.0

Actual Error:

jaro_winkler expected two Strings or two Unicodes

I'm assuming this is because the jaro_winkler function is seeing an ndarray instead of a string, and I'm not sure how to convert the function input to a string in the context of the pdist function.

Does anyone have a suggestion to allow this to work? Thanks in advance!


Solution

  • You need to wrap the distance function, like I demonstrated in the following example with the Levensthein distance

    import numpy as np    
    from Levenshtein import distance
    from scipy.spatial.distance import pdist, squareform
    
    # my list of strings
    strings = ["hello","hallo","choco"]
    
    # prepare 2 dimensional array M x N (M entries (3) with N dimensions (1)) 
    transformed_strings = np.array(strings).reshape(-1,1)
    
    # calculate condensed distance matrix by wrapping the Levenshtein distance function
    distance_matrix = pdist(transformed_strings,lambda x,y: distance(x[0],y[0]))
    
    # get square matrix
    print(squareform(distance_matrix))
    
    Output:
    array([[ 0.,  1.,  4.],
           [ 1.,  0.,  4.],
           [ 4.,  4.,  0.]])