pythonnlplevenshtein-distancedata-preprocessingmisspelling

Correct typos inside a column using word distance


if have a column inside a pandas df containing a bunch of names:

NAME
-------
robert
robert
robrt
marie
ann

I'd like to merge similar ones in order to correct/uniform typos, resulting in:

NAME
-------
robert
robert
robert
marie
ann

I would like to use Levenshtein distance in order to search for similar records. Also, solutions using other metrics are much appreciated.

Thanks a lot in advance

All examples on Stackoverflow seem to compare multiple columns, so I have not been able to find a nice solution to my problem.


Solution

  • One possible approach is the following:

    import pandas as pd
    from sklearn.cluster import AgglomerativeClustering
    from Levenshtein import distance
    import numpy as np
    
    df = pd.DataFrame({'NAME': ['robert', 'robert', 'robrt', 'marie', 'ann']})
    
    def merge_similar_names(df, column):
        unique_names = df[column].str.lower().str.strip().unique()
        distances = np.zeros((len(unique_names), len(unique_names)))
        for i in range(len(unique_names)):
            for j in range(i, len(unique_names)):
                d = distance(unique_names[i], unique_names[j])
                distances[i, j] = d
                distances[j, i] = d
        clusterer = AgglomerativeClustering(n_clusters=None, distance_threshold=2, linkage='complete', affinity='precomputed')
        clusters = clusterer.fit_predict(distances)
        name_clusters = pd.DataFrame({'NAME': unique_names, 'CLUSTER': clusters})
        df = pd.merge(df, name_clusters, on='NAME')
        most_common_names = df.groupby('CLUSTER')[column].apply(lambda x: x.value_counts().index[0]).reset_index()
        df = pd.merge(df, most_common_names, on='CLUSTER')
        df.rename(columns={column+'_y': column}, inplace=True)
        return df
    
    df = merge_similar_names(df, 'NAME')
    
    print(df)
    
    

    which will give you

       NAME_x  CLUSTER    NAME
    0  robert        0  robert
    1  robert        0  robert
    2   robrt        0  robert
    3   marie        2   marie
    4     ann        1     ann