I have a list of ~80k words which have potential spelling mistakes
(e.g., "apple" vs "applee" vs " apple" vs " aplee ").
I'm planning to great a dataframe grid by picking two words at a time and then applying a fuzzy score function to compare the similarity. I am also applying standard text cleaning such as trimming, removing special characters, double spaces, etc. and then getting the unique list to check for similarity
I'm using the itertools.combinations
function to create a the dataframe grid
#Sample python code
my_unique_list = ['apple','applee','aplee']
data_grid = pd.DataFrame(itertools.combinations(my_unique_list,2),columns = ['name1','name2'])
name1 name2
0 apple applee
1 apple aplee
2 applee aplee
I have defined a function that calculates the fuzzyscore
def fuzzy_score_func(row):
fuzzywuzzy_partial_ratio = fuzz.partial_ratio(row['name1'],row['name2'])
thefuzz_ratio = fuzz.ratio(row['name1'],row['name2'])
return fuzzywuzzy_partial_ratio, thefuzz_ratio
and use apply function to get the final score
data_grid[['partial_ratio','ratio']] = data_grid.apply(fuzzy_score_func,axis = 1, result_type='expand')
name1 name2 partial_ratio ratio
0 apple applee 100 91
1 apple aplee 80 80
2 applee aplee 80 91
This works fine when the list is ~8k where checking all combination has ~25Mn rows in the dataframe.
But when I try to expand the list to 80k, I get memory error in step 1 when I'm trying to initialize the dataframe with all possible combination. Which makes sense given the size of the dataframe would be ~6.4Bn row
File ~\AppData\Local\anaconda3\Lib\site-packages\pandas\core\, in DataFrame.__init__(self, data, index, columns, dtype, copy)
736 data = np.asarray(data)
737 else:
--> 738 data = list(data)
739 if len(data) > 0:
740 if is_dataclass(data[0]):
Any suggestion on how to tackle this memory issue or if there's a better way to implement my problem statement. I tried exploring multiprocessing, nested loops, etc. but no major success.
I'm using an Intel windows laptop
Processor: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz 3.00 GHz
Installed RAM: 32.0 GB (31.7 GB usable)
System type: 64-bit operating system, x64-based processor
I might try starting with this code based on just using itertools
without pandas.
import csv
import itertools
import fuzzywuzzy.fuzz
## ----------------------
## the result of cleaning and filtering your input data...
## ----------------------
my_unique_list = ['apple','applee','aplee']
## ----------------------
## ----------------------
## Create a result file of acceptably close matches
## ----------------------
with open("good_matches.csv", "w", encoding="utf-8", newline="") as file_out:
writer = csv.writer(file_out)
writer.writerow(["name1", "name2", "partial_ratio", "ratio"])
for index, (word1, word2) in enumerate(itertools.combinations(my_unique_list, 2)):
if index % 1000 == 0:
print(f"combinations processed: {index}", end="\r", flush=True)
partial_ratio = fuzzywuzzy.fuzz.partial_ratio(word1, word2)
ratio = fuzzywuzzy.fuzz.ratio(word1, word2)
if max(partial_ratio, ratio) >= MIN_RATION:
writer.writerow([word1, word2, partial_ratio, ratio])
print(f"Total combinations processed: {index+1}")
## ----------------------
While I'm not a multiprocessing expert, this might work. You might want to test it a bit on a smaller subset:
import csv
import functools
import itertools
import multiprocessing
import fuzzywuzzy.fuzz
def get_ratios(pair, queue):
partial_ratio = fuzzywuzzy.fuzz.partial_ratio(*pair)
ratio = fuzzywuzzy.fuzz.ratio(*pair)
if max(partial_ratio, ratio) >= MIN_RATION:
queue.put(list(pair) + [partial_ratio, ratio])
def main(my_unique_list):
with multiprocessing.Manager() as manager:
queue = manager.Queue()
with multiprocessing.Pool(processes=8) as pool:
_ =, queue=queue), itertools.combinations(my_unique_list, 2), chunksize=1000)
with open("good_matches.csv", "w", encoding="utf-8", newline="") as file_out:
writer = csv.writer(file_out)
writer.writerow(["name1", "name2", "partial_ratio", "ratio"])
while not queue.empty():
item = queue.get()
if __name__ == "__main__":
## ----------------------
## the result of cleaning and filtering your input data...
## ----------------------
my_unique_list = ['apple','applee','aplee']
## ----------------------