pythonpandasjupyter-notebookfuzzywuzzyentityresolver

fastest way to do fuzzy matching two strings in pandas data frame


I have two data frames with name list

df1[name]   -> number of rows 3000

df2[name]   -> number of rows 64000

I am using fuzzy wuzzy to get the best match for df1 entries from df2 using the following code:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

matches = [process.extract(x, df1, limit=1) for x in df2]

But this is taking forever to finish. Is there any faster way to do the fuzzy matching of strings in pandas?


Solution

  • One improvement i can see in your code is to use generator, so instead of square brackets, you can use round brackets. it will increase the speed by multiple time.

    matches = (process.extract(x, df1, limit=1) for x in df2)
    

    Edit: One more suggestion, we can parallelize the operation with multiprocessing library.