I'm trying to reduce a list of names, and in order to perform this I'm using the fuzzywuzzy
library.
I perform two for loops, both over all the names. If the two names have a fuzzy match score between the 90 and the 100, Then I rewrite the second name with the first name.
Here is an example of my dataset, data
.
nombre
0 VICTOR MORENO MORENO
1 SERGIO HERNANDEZ GUTIERREZ
2 FRANCISCO JAVIER MUÑOZ LOPEZ
3 JUAN RAYMUNDO MORALES MARTINEZ
4 IVAN ERNESTO SANCHEZ URROZ
And here is my function:
def fuzz_analisis0(top_names):
for name2 in top_names["nombre"]:
for name in top_names["nombre"]:
if fuzz.ratio(name, name2)>90 and fuzz.ratio(name, name2)<100:
top_names[top_names["nombre"]==name] = name2
When I run this with:
fuzz_analisis0(data)
Everything works fine. Here is an output that shows how it works.
print(len(data))
# 1400
data = data.drop_duplicates()
print(len(data))
# 1256
But now, if I try it with parallel processing, it no longer works as expected. Here is the parallelized code:
cores = mp.cpu_count()
df_split = np.array_split(data, cores, axis=0)
pool = Pool(cores)
df_out = np.vstack(pool.map(fuzz_analisis0, df_split))
pool.close()
pool.join()
pool.clear()
The function ends faster than expected and does not find any duplicates.
print(len(data))
# 1400
data = data.drop_duplicates()
print(len(data))
# 1400
If any can help me to figure out what is happening here and how to solve it, I'll be so grateful. Thanks in advance.
edit:
now i have this another function that works with the result of the last one
def fuzz_analisis(dataframe, top_names):
for index in top_names['nombre'].index:
name2 = top_names.loc[index,'nombre']
for index2 in dataframe["nombre"].index:
name = dataframe.loc[index2,'nombre']
if fuzz.ratio(name, name2)>90 and fuzz.ratio(name, name2)<100:
dataframe.loc[index,'nombre'] = name
the dataframe looks loke this:
nombre foto1 foto2 sexo fecha_hora_registro
folio
131 JUAN DOMINGO GONZALEZ DELGADO 131.jpg 131.jpg MASCULINO 2008-08-07 15:42:25
132 FRANCISCO JAVIER VELA RAMIREZ 132.jpg 132.jpg MASCULINO 2008-08-07 15:50:42
133 JUAN CARLOS PEREZ MEDINA 133.jpg 133.jpg MASCULINO 2008-08-07 16:37:24
134 ARMANDO SALINAS SALINAS 134.jpg 134.jpg MASCULINO 2008-08-07 17:18:12
135 JOSE FELIX ZAMBRANO AMEZQUITA 135.jpg 135.jpg MASCULINO 2008-08-07 17:55:05
When you recognise your algorithm to be slow multiprocessing is a way to speed it up. However you should probably try to speed up the algorithm first. When using fuzzywuzzy fuzz.ratio
will calculate a normalized levenshtein distance, which is a O(N*M) operation. Therefore you should try to minimise the usage. So here is a optimised solution of mcskinner's
multiprocessed solution:
from functools import partial
from fuzzywuzzy import fuzz
import multiprocessing as mp
import numpy as np
def length_ratio(s1, s2):
s1_len = len(s1)
s2_len = len(s2)
distance = s1_len - s2_len if s1_len > s2_len else s2_len - s1_len
lensum = s1_len + s2_len
return 100 - 100 * distance / lensum
def fuzz_analisis0_partial(top_names, partial_top_names):
for name2 in top_names["nombre"]:
for name in partial_top_names["nombre"]:
if length_ratio(name, name2) < 90:
continue
ratio = fuzz.ratio(name, name2)
if ratio>90 and ratio<100:
partial_top_names[partial_top_names["nombre"] == name] = name2
return partial_top_names
cores = mp.cpu_count()
df_split = np.array_split(data, cores, axis=0)
pool = mp.Pool(cores)
processed_parts = pool.map(partial(fuzz_analisis0_partial, data), df_split)
processed = np.vstack(list(processed_parts))
pool.close()
pool.join()
First of this solution only executes fuzz.ratio once instead of twice and since this is what takes most of the time, this should give you about a 50% runtime improvement. Then as a second improvement it checks a length based ratio beforehand. This length based ratio is always at least as big as fuzz.ratio
, but can be calculated in constant time. Therefore all names that have a big length difference can be processed a lot faster. Beside this make sure your using fuzzywuzzy with python-Levenshtein since this is a lot faster than the version using difflib. As an even faster alternative you could use RapidFuzz(I am the author of RapidFuzz). RapidFuzz already calculates the length ratio when you pass it a cutoff score fuzz.ratio(name, name2, score_cutoff=90)
, so the lenght_ratio function is not required when using it.
Using RapidFuzz the equivalent function fuzz_analisis0_partial
can be programmed the following way:
from rapidfuzz import fuzz
def fuzz_analisis0_partial(top_names, partial_top_names):
for name2 in top_names["nombre"]:
for name in partial_top_names["nombre"]:
ratio = fuzz.ratio(name, name2, score_cutoff=90)
if ratio > 90 and ratio < 100:
partial_top_names[partial_top_names["nombre"] == name] = name2
return partial_top_names