pythonfor-loopiterationfuzzywuzzy

Skip iterating on itself (List)


I am trying to find similar emails in a list. For this,

database.head()

    TID PID Names
0   22330   134575  tim
1   22333   134578  tim.rand
2   22328   134571  rand.001
3   22340   134568  pankit090
4   22325   134569  timrook

emailsdb = database['Names'].values.tolist() Now the iteration part

list = []
for email in emailsdb :
    result = process.extractBests(email, emailsdb, score_cutoff=85, limit=100)
    list.append(result)

The list output is

[[('tim', 100), ('tim.rand', 90), ('timrook', 90)],
 [('tim.rand', 100), ('tim', 90)],
 [('rand.001', 100)],
 [('pankit090', 100),
  ('pankit001', 89),
  ('pankit002', 89),
  ('pankit003', 89),
  ('pankit004', 89),
  ('pankit005', 89)],
 [('timrook', 100), ('tim', 90)],
 [('pankit001', 100),
  ('pankit090', 89),
  ('pankit002', 89),
  ('pankit003', 89),
  ('pankit004', 89),
  ('pankit005', 89)],
 [('pankit002', 100),
  ('pankit090', 89),
  ('pankit001', 89),
  ('pankit003', 89),
  ('pankit004', 89),
  ('pankit005', 89)],

but I want to avoid ('tim', 100), ('tim.rand', 100), ('rand.001', 100), ('pankit090', 100), ('timrook', 100), ('pankit001', 100),('pankit002', 100) because these are obviously perfect match


Solution

  • Posting this as another answer since this solves it in a completely different way, in light of the discussion we had on chat.

    Approach -

    1. If you use a commutative function to get string matches, then you can actually compute only the lower triangular matrix of the distance matrix since f(x,y) will be the same as f(y,x).

    2. Fuzzy ratio is NOT commutative while Levenshtein distance (aka edit distance, which is the foundation of fuzzy matching) is commutative. Here instead getting maximum score, you are finding strings with the minimum lev distance between them

    3. So first you take the indexes for lower triangular matrix and iterate over them to calculate the lev distance of the emails with those indexes. You DONT iterate on the diagonal of the distance matrix f(x,x) because its trivial, and later you set it as a high value like 999 so that you can find the min score values in each row.

    4. This gives you the distance matrix (see output). Next for each row (email) you find the nearest string (minimum distance) and create it as a tuple of Best Matches (see output)

    5. Now, the issue here is that even if NONE of the string match that well, it will still greedily get the nearest one.

    6. So finally you can take the fuzz.ratio between these strings and get fuzzy scores, which you can filter on. So now, you can avoid garbage matches and only get the ones which truely match above a certain threshold. This gives Final Matching strings (see output)

    This should be probably the best you can do with optimizing the code from an algorithmic perspective.

    import pandas as pd
    import numpy as np
    
    #!pip install python-Levenshtein
    from Levenshtein import distance
    from fuzzywuzzy import fuzz, process
    
    #CHECKING COMMUTATIVE PROPERTY
    distance('abcd','bdca') == distance('bdca', 'abcd')
    ###OUTPUT - TRUE
    
    fuzz.ratio('abcd','bdca') == fuzz.ratio('bdca', 'abcd')
    ###OUTPUT - FALSE
    
    
    #Create dummy matrix
    m = np.zeros((len(emailsdb),len(emailsdb)))
    
    #Get lower triangular matrix indices
    i,j = np.tril_indices(len(emailsdb))
    
    #iterate over lower triangular and get Levenshtein distance for lower triangular
    for k in zip(i,j):
        if k[0]!=k[1]:
            m[k] = distance(emailsdb[k[0]], emailsdb[k[1]])
        else:
            m[k] = 0
    
    #Copy lower triangular to upper traingular thanks to COMMUTATIVE PROPERTY
    m = m + m.T
    
    #Filling diagonal with 999 so that same string doesnt show up with minimum distance
    np.fill_diagonal(m, 999)
    print("distance matrix = ")
    print(m)
    
    #Get best matches with minimum distance
    matches = list(zip([i for i in emailsdb], [emailsdb[i] for i in np.argmin(m, axis=0)]))
    
    print("Best matches = ")
    print(matches)
    
    
    #OPTIONAL -> Filtering out irrelavant matches based on fuzzy match
    f = [(*i,fuzz.ratio(*i)) for i in matches if fuzz.ratio(*i)>50]
    
    print("final matching strings - ")
    print(f)
    
    distance matrix = 
    [[999.   5.   8.   8.   4.]
     [  5. 999.   8.   9.   4.]
     [  8.   8. 999.   6.   8.]
     [  8.   9.   6. 999.   9.]
     [  4.   4.   8.   9. 999.]]
    
    
    Best matches = 
    [('tim', 'timrook'), ('tim.rand', 'timrook'), ('rand.001', 'pankit090'), ('pankit090', 'rand.001'), ('timrook', 'tim')]
    
    
    final matching strings - 
    [('tim', 'timrook', 60), ('tim.rand', 'timrook', 53), ('timrook', 'tim', 60)]