pythonmultiprocessingpathos

how to parallel procesing this nested loop on python


I'm trying to reduce a list of names, and in order to perform this I'm using the fuzzywuzzy library.

I perform two for loops, both over all the names. If the two names have a fuzzy match score between the 90 and the 100, Then I rewrite the second name with the first name.

Here is an example of my dataset, data.

                              nombre
0               VICTOR MORENO MORENO
1         SERGIO HERNANDEZ GUTIERREZ
2       FRANCISCO JAVIER MUÑOZ LOPEZ
3     JUAN RAYMUNDO MORALES MARTINEZ
4         IVAN ERNESTO SANCHEZ URROZ

And here is my function:

def fuzz_analisis0(top_names):
    for name2 in top_names["nombre"]:    
        for name in top_names["nombre"]: 
            if fuzz.ratio(name, name2)>90 and fuzz.ratio(name, name2)<100:
                top_names[top_names["nombre"]==name] = name2

When I run this with:

fuzz_analisis0(data)

Everything works fine. Here is an output that shows how it works.

print(len(data))
# 1400

data = data.drop_duplicates()
print(len(data))
# 1256

But now, if I try it with parallel processing, it no longer works as expected. Here is the parallelized code:

cores = mp.cpu_count()
df_split = np.array_split(data, cores, axis=0)
pool = Pool(cores)
df_out = np.vstack(pool.map(fuzz_analisis0, df_split))
pool.close()
pool.join()
pool.clear()

The function ends faster than expected and does not find any duplicates.

print(len(data))
# 1400

data = data.drop_duplicates()
print(len(data))
# 1400

If any can help me to figure out what is happening here and how to solve it, I'll be so grateful. Thanks in advance.

edit:

now i have this another function that works with the result of the last one

def fuzz_analisis(dataframe, top_names):  
    for index in top_names['nombre'].index:
        name2 = top_names.loc[index,'nombre']       
        for index2 in dataframe["nombre"].index:
            name = dataframe.loc[index2,'nombre']   

            if fuzz.ratio(name, name2)>90 and fuzz.ratio(name, name2)<100:
                    dataframe.loc[index,'nombre'] = name

the dataframe looks loke this:

    nombre  foto1   foto2   sexo    fecha_hora_registro
folio                   
131     JUAN DOMINGO GONZALEZ DELGADO   131.jpg     131.jpg     MASCULINO   2008-08-07 15:42:25
132     FRANCISCO JAVIER VELA RAMIREZ   132.jpg     132.jpg     MASCULINO   2008-08-07 15:50:42
133     JUAN CARLOS PEREZ MEDINA    133.jpg     133.jpg     MASCULINO   2008-08-07 16:37:24
134     ARMANDO SALINAS SALINAS     134.jpg     134.jpg     MASCULINO   2008-08-07 17:18:12
135     JOSE FELIX ZAMBRANO AMEZQUITA   135.jpg     135.jpg     MASCULINO   2008-08-07 17:55:05

Solution

  • When you recognise your algorithm to be slow multiprocessing is a way to speed it up. However you should probably try to speed up the algorithm first. When using fuzzywuzzy fuzz.ratio will calculate a normalized levenshtein distance, which is a O(N*M) operation. Therefore you should try to minimise the usage. So here is a optimised solution of mcskinner's multiprocessed solution:

    from functools import partial
    from fuzzywuzzy import fuzz
    import multiprocessing as mp
    import numpy as np
    
    def length_ratio(s1, s2):
        s1_len = len(s1)
        s2_len = len(s2)
        distance = s1_len - s2_len if s1_len > s2_len else s2_len - s1_len
        lensum = s1_len + s2_len
        return 100 - 100 * distance / lensum
    
    def fuzz_analisis0_partial(top_names, partial_top_names): 
        for name2 in top_names["nombre"]: 
            for name in partial_top_names["nombre"]:
                if length_ratio(name, name2) < 90:
                  continue
    
                ratio = fuzz.ratio(name, name2)
                if ratio>90 and ratio<100: 
                    partial_top_names[partial_top_names["nombre"] == name] = name2 
        return partial_top_names
    
    cores = mp.cpu_count() 
    df_split = np.array_split(data, cores, axis=0) 
    
    pool = mp.Pool(cores)
    processed_parts = pool.map(partial(fuzz_analisis0_partial, data), df_split)
    processed = np.vstack(list(processed_parts))
    
    pool.close() 
    pool.join()
    

    First of this solution only executes fuzz.ratio once instead of twice and since this is what takes most of the time, this should give you about a 50% runtime improvement. Then as a second improvement it checks a length based ratio beforehand. This length based ratio is always at least as big as fuzz.ratio, but can be calculated in constant time. Therefore all names that have a big length difference can be processed a lot faster. Beside this make sure your using fuzzywuzzy with python-Levenshtein since this is a lot faster than the version using difflib. As an even faster alternative you could use RapidFuzz(I am the author of RapidFuzz). RapidFuzz already calculates the length ratio when you pass it a cutoff score fuzz.ratio(name, name2, score_cutoff=90), so the lenght_ratio function is not required when using it.

    Using RapidFuzz the equivalent function fuzz_analisis0_partial can be programmed the following way:

    from rapidfuzz import fuzz
    
    def fuzz_analisis0_partial(top_names, partial_top_names): 
        for name2 in top_names["nombre"]: 
            for name in partial_top_names["nombre"]:
                ratio = fuzz.ratio(name, name2, score_cutoff=90)
                if ratio > 90 and ratio < 100: 
                    partial_top_names[partial_top_names["nombre"] == name] = name2 
        return partial_top_names