pythonpython-3.xfuzzywuzzyrapidfuzz

optimizing RapidFuzz for a list with large number of elements (e.g. 200,000)


I would like to run this piece of rapidfuzz code mentioned in this post on a list with 200,000 elements. I am wondering what's the best way to optimize this for a faster run on GPU?

Find fuzzy match string in a list with matching string value and their count

import pandas as pd
from rapidfuzz import fuzz

elements = ['vikash', 'vikas', 'Vinod', 'Vikky', 'Akash', 'Vinodh', 'Sachin', 'Salman', 'Ajay', 'Suchin', 'Akash', 'vikahs']

results = [[name, [], 0] for name in elements]

for (i, element) in enumerate(elements):
    for (j, choice) in enumerate(elements[i+1:]):
        if fuzz.ratio(element, choice, score_cutoff=90):
            results[i][2] += 1
            results[i][1].append(choice)
            results[j+i+1][2] += 1
            results[j+i+1][1].append(element)

data = pd.DataFrame(results, columns=['name', 'duplicates', 'duplicate_count'])


Expected Output -

name        duplicates  duplicate_count
0   vikash           [vikas]                1
1    vikas  [vikash, vikahs]                2
2    Vinod          [Vinodh]                1
3    Vikky                []                0
4    Akash           [Akash]                1
5   Vinodh           [Vinod]                1
6   Sachin                []                0
7   Salman                []                0
8     Ajay                []                0
9   Suchin                []                0
10   Akash           [Akash]                1
11  vikahs           [vikas]                1




Solution

  • The rapidfuzz library has a function for speedup which takes the parallel processing power of CPU to quicken the process.

    The workers argument enables parallel processing. With the value workers=-1, you will be using all the cores available.

    from rapidfuzz.process import cdist
    
    # Calculate distance between all the names
    sa = cdist(elements, elements, score_cutoff=90, workers=-1)
    
    duplicates_list = []
    
    for distances in sa:
        # Get indices of duplicates
        indices = np.argwhere(~np.isin(distances, [100, 0])).flatten()
        # Get names from indices
        names = list(map(elements.__getitem__, indices))
        duplicates_list.append(names)
    
    # Create dataframe using the data
    df = pd.DataFrame({'name': elements, 'duplicates': duplicates_list})
    df['duplicate_count'] = df.duplicates.str.len()
    

    Output

          name        duplicates  duplicate_count
    0   vikash           [vikas]                1
    1    vikas  [vikash, vikahs]                2
    2    Vinod          [Vinodh]                1
    3    Vikky                []                0
    4    Akash                []                0
    5   Vinodh           [Vinod]                1
    6   Sachin                []                0
    7   Salman                []                0
    8     Ajay                []                0
    9   Suchin                []                0
    10   Akash                []                0
    11  vikahs           [vikas]                1