[SOLVED] Finding all similar values in pandas using SequenceMatcher Python

Finding all similar values in pandas using SequenceMatcher Python

I'm trying to filter on a specific value in pandas in a column but also allow for typing mistakes. I thought using SequenceMatcher was a good solution but I don't know what the best way is to apply it within a DataFrame. Let's say the headers are 'number' and 'location'.

df1 = [[1, Amsterdam], [2, amsterdam], [3, rotterdam], [4, amstrdam], [5, Berlin]]

If I want to filter on 'amsterdam' with a certain ratio, let's say 0.6. The output probably would be like this.

df1 = [[1, Amsterdam], [2, amsterdam], [4, amstrdam]]

What would be the best way to get this done? I was thinking about using an filter option but that didn't work. Do I need to first run an apply function to add a column with the ratio and then be able to filter on it? Or is there a smarter way?

df2 = df1[SequenceMatcher(None, location, df1.location).ratio() > 0.6]

Solution

You were on the right track, using apply and loc you can filter the df. I've put the ratio to 0.7 otherwise rotterdam would also be matched.

import difflib
import pandas as pd

df = pd.DataFrame([[1, 'Amsterdam'], [2, 'amsterdam'], [3, 'rotterdam'], [4, 'amstrdam'], [5, 'Berlin']])
df.columns = ['number', 'location']

df = df.loc[df.apply(lambda x: difflib.SequenceMatcher(None, 'Amsterdam', x.location).ratio() > 0.7, axis=1)]

print(df)

   number   location
0       1  Amsterdam
1       2  amsterdam
3       4   amstrdam