pythonparallel-processingmultiprocessingfuzzywuzzyfuzzy

Parallel processing with rapidfuzz function


I have a dataset of 100 000 records. My problem is many to many type where i need to calculate the fuzzy score of name column in each row with 100k rows. I am using for loop to iterate each row and calculating the fuzz score by using pandas apply method. The real problem is time which the code is taking around 15 hours, so I tried using parallel processing and multiprocessing to reduce this time but eventually failed applying such things.

Dataframe looks like the below example:

id   Name
1    Alpha
2    Beta
3    Gamma
4    Theta
5    Lambda
.      .
.      .
.      .
and so on to 100k records

What I am expecting is to create a dataframe which holds data with fuzz score value above 75.

Expected output:

id_1   Name_1   id_2   Name_2   Score
1      Alpha    39     Alph     88
3      Gamma    78     Gamme    80
4      Theta    56     heta     88

I can't use pd.merge for cross join and then calculate the score using apply method as this method needs a lot of ram.


Solution

  • Thank you but i found out the solution. I created the batches of dataframe and then processed them with the help of inbuilt multithreading option in rapidfuzz. for each batch i updated their index with the length of batch and the total lenght of previous batch if any so i can easily find out their original index.

    Below is the solution:

    from rapidfuzz.process import cdist
    import rapidfuzz
    val = 0
    df_shape = 0
    index_vals = {}
    temp_list = []
    batch_len = 10000
    for i in range(batch_len,df.shape[0]+batch_len+1, batch_len):
        if df[val:i].shape[0]!=0:
            ind_score = cdist(df.loc[val:i,'Name']),
                              df.loc[:,'Name'],
                              scorer = rapidfuzz.fuzz.token_sort_ratio,
                              workers = -1)
            for j in range(len(ind_score)):
                index_vals[j+df_shape] = list(np.where(ind_score[j]>85)[0])
        df_shape += df[val:i].shape[0]
        val = i
    

    This took me only 1.5 hours instead of 17 hours for 150k rows

    Thank you.