Context:
I have a list of ~80k words which have potential spelling mistakes
(e.g., "apple" vs "applee" vs " apple" vs " aplee ").
I'm planning to great a dataframe grid by picking two words at a time and then applying a fuzzy score function to compare the similarity. I am also applying standard text cleaning such as trimming, removing special characters, double spaces, etc. and then getting the unique list to check for similarity
Approach:
I'm using the itertools.combinations
function to create a the dataframe grid
#Sample python code
#Step1:
my_unique_list = ['apple','applee','aplee']
data_grid = pd.DataFrame(itertools.combinations(my_unique_list,2),columns = ['name1','name2'])
print(data_grid)
name1 name2
0 apple applee
1 apple aplee
2 applee aplee
I have defined a function that calculates the fuzzyscore
def fuzzy_score_func(row):
fuzzywuzzy_partial_ratio = fuzz.partial_ratio(row['name1'],row['name2'])
thefuzz_ratio = fuzz.ratio(row['name1'],row['name2'])
return fuzzywuzzy_partial_ratio, thefuzz_ratio
and use apply function to get the final score
#Step2:
data_grid[['partial_ratio','ratio']] = data_grid.apply(fuzzy_score_func,axis = 1, result_type='expand')
print(data_grid)
name1 name2 partial_ratio ratio
0 apple applee 100 91
1 apple aplee 80 80
2 applee aplee 80 91
This works fine when the list is ~8k where checking all combination has ~25Mn rows in the dataframe.
But when I try to expand the list to 80k, I get memory error in step 1 when I'm trying to initialize the dataframe with all possible combination. Which makes sense given the size of the dataframe would be ~6.4Bn row
File ~\AppData\Local\anaconda3\Lib\site-packages\pandas\core\frame.py:738, in DataFrame.__init__(self, data, index, columns, dtype, copy)
736 data = np.asarray(data)
737 else:
--> 738 data = list(data)
739 if len(data) > 0:
740 if is_dataclass(data[0]):
MemoryError:
Any suggestion on how to tackle this memory issue or if there's a better way to implement my problem statement. I tried exploring multiprocessing, nested loops, etc. but no major success.
I'm using an Intel windows laptop
Processor: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz 3.00 GHz
Installed RAM: 32.0 GB (31.7 GB usable)
System type: 64-bit operating system, x64-based processor
I might try starting with this code based on just using itertools
without pandas.
import csv
import itertools
import fuzzywuzzy.fuzz
MIN_RATION = 90
## ----------------------
## the result of cleaning and filtering your input data...
## ----------------------
my_unique_list = ['apple','applee','aplee']
## ----------------------
## ----------------------
## Create a result file of acceptably close matches
## ----------------------
with open("good_matches.csv", "w", encoding="utf-8", newline="") as file_out:
writer = csv.writer(file_out)
writer.writerow(["name1", "name2", "partial_ratio", "ratio"])
for index, (word1, word2) in enumerate(itertools.combinations(my_unique_list, 2)):
if index % 1000 == 0:
print(f"combinations processed: {index}", end="\r", flush=True)
partial_ratio = fuzzywuzzy.fuzz.partial_ratio(word1, word2)
ratio = fuzzywuzzy.fuzz.ratio(word1, word2)
if max(partial_ratio, ratio) >= MIN_RATION:
writer.writerow([word1, word2, partial_ratio, ratio])
print()
print(f"Total combinations processed: {index+1}")
## ----------------------
While I'm not a multiprocessing expert, this might work. You might want to test it a bit on a smaller subset:
import csv
import functools
import itertools
import multiprocessing
import fuzzywuzzy.fuzz
MIN_RATION = 90
def get_ratios(pair, queue):
partial_ratio = fuzzywuzzy.fuzz.partial_ratio(*pair)
ratio = fuzzywuzzy.fuzz.ratio(*pair)
if max(partial_ratio, ratio) >= MIN_RATION:
queue.put(list(pair) + [partial_ratio, ratio])
def main(my_unique_list):
with multiprocessing.Manager() as manager:
queue = manager.Queue()
with multiprocessing.Pool(processes=8) as pool:
_ = pool.map(functools.partial(get_ratios, queue=queue), itertools.combinations(my_unique_list, 2), chunksize=1000)
pool.close()
pool.join()
with open("good_matches.csv", "w", encoding="utf-8", newline="") as file_out:
writer = csv.writer(file_out)
writer.writerow(["name1", "name2", "partial_ratio", "ratio"])
while not queue.empty():
item = queue.get()
writer.writerow(item)
#print(item)
if __name__ == "__main__":
## ----------------------
## the result of cleaning and filtering your input data...
## ----------------------
my_unique_list = ['apple','applee','aplee']
## ----------------------
main(my_unique_list)