I have two data frames with name list
df1[name] -> number of rows 3000
df2[name] -> number of rows 64000
I am using fuzzy wuzzy to get the best match for df1 entries from df2 using the following code:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
matches = [process.extract(x, df1, limit=1) for x in df2]
But this is taking forever to finish. Is there any faster way to do the fuzzy matching of strings in pandas?
One improvement i can see in your code is to use generator, so instead of square brackets, you can use round brackets. it will increase the speed by multiple time.
matches = (process.extract(x, df1, limit=1) for x in df2)
Edit: One more suggestion, we can parallelize the operation with multiprocessing
library.