pythonpandasrapidfuzz

Is there a way to speed up matching addresses and level of confidence per match between two data frames for large datasets?


I have got a script below that check the accuracy of a column of addresses in my dataframe against a column of addresses in another dataframe, to see if they match and how well they match.

I am using rapid fuzz I heard it is faster than fuzzywuzzy. However it is still taking a very long time to do the match and calculations. Here is the CSV files. main_dataset.csv contains about 3 million records, and reference_dataset.csv contains about 10 records.

Below is the time it took for each record.

start time: Thu Oct  6 10:51:18 2022
end time: Thu Oct  6 10:51:23 2022
start time: Thu Oct  6 10:51:23 2022
end time: Thu Oct  6 10:51:28 2022
start time: Thu Oct  6 10:51:28 2022
end time: Thu Oct  6 10:51:32 2022
start time: Thu Oct  6 10:51:32 2022
end time: Thu Oct  6 10:51:36 2022
start time: Thu Oct  6 10:51:36 2022
end time: Thu Oct  6 10:51:41 2022
start time: Thu Oct  6 10:51:41 2022
end time: Thu Oct  6 10:51:45 2022
start time: Thu Oct  6 10:51:45 2022
end time: Thu Oct  6 10:51:50 2022
start time: Thu Oct  6 10:51:50 2022
end time: Thu Oct  6 10:51:54 2022
start time: Thu Oct  6 10:51:54 2022
end time: Thu Oct  6 10:51:59 2022

My script is here:

import pandas as pd
from rapidfuzz import process, fuzz
import time
from dask import dataframe as dd

ref_df = pd.read_csv('reference_dataset.csv')
df = dd.read_csv('main_dataset.csv', low_memory=False)

contacts_addresses = list(df.address)
ref_addresses = list(ref_df.ref_address.unique())

def scoringMatches(x, s):
    o = process.extract(x, s, score_cutoff = 60)
    if o != None:
        return o[1]

def match_addresses(add, contacts_addresses, min_score=0):
    response = process.extract(add, contacts_addresses, scorer=fuzz.token_sort_ratio)
    return response


def get_highest_score(scores):
    total_scores = []
    for val in scores:
        total_scores.append(val[1])
    max_value = max(total_scores)
    max_index = total_scores.index(max_value)
    return scores[max_index]


scores_list = []
names = []
for x in ref_addresses:
    # start = time.time()
    # print("start time:", time.ctime(start))
    scores = match_addresses(x, contacts_addresses, 75)
    match = get_highest_score(scores)
    name = (str(x), str(match[0]))
    names.append(name)
    score = int(match[1])
    scores_list.append(score)
    # end = time.time()
    # print("end time:", time.ctime(end))
name_dict = dict(names)

match_df = pd.DataFrame(name_dict.items(), columns=['ref_address', 'matched_address'])
scores_df = pd.DataFrame(scores_list)

merged_results_01 = pd.concat([match_df, scores_df], axis=1)

merged_results_02 = pd.merge(ref_df, merged_results_01, how='right', on='ref_address')
merged_results_02.to_csv('results.csv')


Solution

  • It is recommended to use process.cdist which compares two sequences and obtains a similarity matrix instead of process.extract/process.extractOne right now, since a lot of the newer performance improvements only got added to this algorithm so far.

    Namely those improvements are:

    1. support for multithreading using the workers argument
    2. support to compare multiple short sequences (<= 64 characters) in parallel using SIMD on x64.

    Both of these improvements will be added to process.extract and process.extractOne at some point, but at this point (rapidfuzz==v2.11.1) they only exist.

    A couple relevant issues for future improvements on this front are:

    This could be e.g. implemented in the following way:

    from itertools import islice
    
    chunk_size = 100
    ref_addr_iter = iter(ref_addresses)
    while ref_addr_chunk := list(islice(ref_addr_iter, chunk_size)):
        scores = process.cdist(ref_addr_chunk, contacts_addresses, scorer=fuzz.token_sort_ratio, score_cutoff=75, workers=-1)
        max_scores_idx = scores.argmax(axis=1)
        for ref_addr_idx, score_idx in enumerate(max_scores_idx):
            names.append((ref_addr_chunk[ref_addr_idx], contacts_addresses[score_idx]))
            scores_list.append(scores[ref_addr_idx,score_idx])