pythonpandasapplysequencematcher

Finding all similar values in pandas using SequenceMatcher Python


I'm trying to filter on a specific value in pandas in a column but also allow for typing mistakes. I thought using SequenceMatcher was a good solution but I don't know what the best way is to apply it within a DataFrame. Let's say the headers are 'number' and 'location'.

df1 = [[1, Amsterdam], [2, amsterdam], [3, rotterdam], [4, amstrdam], [5, Berlin]]

If I want to filter on 'amsterdam' with a certain ratio, let's say 0.6. The output probably would be like this.

df1 = [[1, Amsterdam], [2, amsterdam], [4, amstrdam]]

What would be the best way to get this done? I was thinking about using an filter option but that didn't work. Do I need to first run an apply function to add a column with the ratio and then be able to filter on it? Or is there a smarter way?

df2 = df1[SequenceMatcher(None, location, df1.location).ratio() > 0.6]


Solution

  • You were on the right track, using apply and loc you can filter the df. I've put the ratio to 0.7 otherwise rotterdam would also be matched.

    import difflib
    import pandas as pd
    
    df = pd.DataFrame([[1, 'Amsterdam'], [2, 'amsterdam'], [3, 'rotterdam'], [4, 'amstrdam'], [5, 'Berlin']])
    df.columns = ['number', 'location']
    
    df = df.loc[df.apply(lambda x: difflib.SequenceMatcher(None, 'Amsterdam', x.location).ratio() > 0.7, axis=1)]
    
    print(df)
    
       number   location
    0       1  Amsterdam
    1       2  amsterdam
    3       4   amstrdam