I have got a script below that check the accuracy of a column of addresses in my dataframe against a column of addresses in another dataframe, to see if they match and how well they match.
I am using rapid fuzz I heard it is faster than fuzzywuzzy. However it is still taking a very long time to do the match and calculations. Here is the CSV files. main_dataset.csv contains about 3 million records, and reference_dataset.csv contains about 10 records.
Below is the time it took for each record.
start time: Thu Oct 6 10:51:18 2022
end time: Thu Oct 6 10:51:23 2022
start time: Thu Oct 6 10:51:23 2022
end time: Thu Oct 6 10:51:28 2022
start time: Thu Oct 6 10:51:28 2022
end time: Thu Oct 6 10:51:32 2022
start time: Thu Oct 6 10:51:32 2022
end time: Thu Oct 6 10:51:36 2022
start time: Thu Oct 6 10:51:36 2022
end time: Thu Oct 6 10:51:41 2022
start time: Thu Oct 6 10:51:41 2022
end time: Thu Oct 6 10:51:45 2022
start time: Thu Oct 6 10:51:45 2022
end time: Thu Oct 6 10:51:50 2022
start time: Thu Oct 6 10:51:50 2022
end time: Thu Oct 6 10:51:54 2022
start time: Thu Oct 6 10:51:54 2022
end time: Thu Oct 6 10:51:59 2022
My script is here:
import pandas as pd
from rapidfuzz import process, fuzz
import time
from dask import dataframe as dd
ref_df = pd.read_csv('reference_dataset.csv')
df = dd.read_csv('main_dataset.csv', low_memory=False)
contacts_addresses = list(df.address)
ref_addresses = list(ref_df.ref_address.unique())
def scoringMatches(x, s):
o = process.extract(x, s, score_cutoff = 60)
if o != None:
return o[1]
def match_addresses(add, contacts_addresses, min_score=0):
response = process.extract(add, contacts_addresses, scorer=fuzz.token_sort_ratio)
return response
def get_highest_score(scores):
total_scores = []
for val in scores:
total_scores.append(val[1])
max_value = max(total_scores)
max_index = total_scores.index(max_value)
return scores[max_index]
scores_list = []
names = []
for x in ref_addresses:
# start = time.time()
# print("start time:", time.ctime(start))
scores = match_addresses(x, contacts_addresses, 75)
match = get_highest_score(scores)
name = (str(x), str(match[0]))
names.append(name)
score = int(match[1])
scores_list.append(score)
# end = time.time()
# print("end time:", time.ctime(end))
name_dict = dict(names)
match_df = pd.DataFrame(name_dict.items(), columns=['ref_address', 'matched_address'])
scores_df = pd.DataFrame(scores_list)
merged_results_01 = pd.concat([match_df, scores_df], axis=1)
merged_results_02 = pd.merge(ref_df, merged_results_01, how='right', on='ref_address')
merged_results_02.to_csv('results.csv')
It is recommended to use process.cdist
which compares two sequences and obtains a similarity matrix instead of process.extract
/process.extractOne
right now, since a lot of the newer performance improvements only got added to this algorithm so far.
Namely those improvements are:
workers
argumentBoth of these improvements will be added to process.extract
and process.extractOne
at some point, but at this point (rapidfuzz==v2.11.1) they only exist.
A couple relevant issues for future improvements on this front are:
This could be e.g. implemented in the following way:
from itertools import islice
chunk_size = 100
ref_addr_iter = iter(ref_addresses)
while ref_addr_chunk := list(islice(ref_addr_iter, chunk_size)):
scores = process.cdist(ref_addr_chunk, contacts_addresses, scorer=fuzz.token_sort_ratio, score_cutoff=75, workers=-1)
max_scores_idx = scores.argmax(axis=1)
for ref_addr_idx, score_idx in enumerate(max_scores_idx):
names.append((ref_addr_chunk[ref_addr_idx], contacts_addresses[score_idx]))
scores_list.append(scores[ref_addr_idx,score_idx])