I have a dataset of 100 000 records. My problem is many to many type where i need to calculate the fuzzy score of name column in each row with 100k rows. I am using for loop to iterate each row and calculating the fuzz score by using pandas apply method. The real problem is time which the code is taking around 15 hours, so I tried using parallel processing and multiprocessing to reduce this time but eventually failed applying such things.
Dataframe looks like the below example:
id Name
1 Alpha
2 Beta
3 Gamma
4 Theta
5 Lambda
. .
. .
. .
and so on to 100k records
What I am expecting is to create a dataframe which holds data with fuzz score value above 75.
Expected output:
id_1 Name_1 id_2 Name_2 Score
1 Alpha 39 Alph 88
3 Gamma 78 Gamme 80
4 Theta 56 heta 88
I can't use pd.merge for cross join and then calculate the score using apply method as this method needs a lot of ram.
Thank you but i found out the solution. I created the batches of dataframe and then processed them with the help of inbuilt multithreading option in rapidfuzz. for each batch i updated their index with the length of batch and the total lenght of previous batch if any so i can easily find out their original index.
Below is the solution:
from rapidfuzz.process import cdist
import rapidfuzz
val = 0
df_shape = 0
index_vals = {}
temp_list = []
batch_len = 10000
for i in range(batch_len,df.shape[0]+batch_len+1, batch_len):
if df[val:i].shape[0]!=0:
ind_score = cdist(df.loc[val:i,'Name']),
df.loc[:,'Name'],
scorer = rapidfuzz.fuzz.token_sort_ratio,
workers = -1)
for j in range(len(ind_score)):
index_vals[j+df_shape] = list(np.where(ind_score[j]>85)[0])
df_shape += df[val:i].shape[0]
val = i
This took me only 1.5 hours instead of 17 hours for 150k rows
Thank you.