I am trying to find similar emails in a list. For this,
database.head()
TID PID Names
0 22330 134575 tim
1 22333 134578 tim.rand
2 22328 134571 rand.001
3 22340 134568 pankit090
4 22325 134569 timrook
emailsdb = database['Names'].values.tolist()
Now the iteration part
list = []
for email in emailsdb :
result = process.extractBests(email, emailsdb, score_cutoff=85, limit=100)
list.append(result)
The list output is
[[('tim', 100), ('tim.rand', 90), ('timrook', 90)],
[('tim.rand', 100), ('tim', 90)],
[('rand.001', 100)],
[('pankit090', 100),
('pankit001', 89),
('pankit002', 89),
('pankit003', 89),
('pankit004', 89),
('pankit005', 89)],
[('timrook', 100), ('tim', 90)],
[('pankit001', 100),
('pankit090', 89),
('pankit002', 89),
('pankit003', 89),
('pankit004', 89),
('pankit005', 89)],
[('pankit002', 100),
('pankit090', 89),
('pankit001', 89),
('pankit003', 89),
('pankit004', 89),
('pankit005', 89)],
but I want to avoid ('tim', 100), ('tim.rand', 100), ('rand.001', 100), ('pankit090', 100), ('timrook', 100), ('pankit001', 100),('pankit002', 100) because these are obviously perfect match
Posting this as another answer since this solves it in a completely different way, in light of the discussion we had on chat.
Approach -
If you use a commutative function to get string matches, then you can actually compute only the lower triangular matrix of the distance matrix since f(x,y) will be the same as f(y,x).
Fuzzy ratio is NOT commutative while Levenshtein distance (aka edit distance, which is the foundation of fuzzy matching) is commutative. Here instead getting maximum score, you are finding strings with the minimum lev distance between them
So first you take the indexes for lower triangular matrix and iterate over them to calculate the lev distance of the emails with those indexes. You DONT iterate on the diagonal of the distance matrix f(x,x) because its trivial, and later you set it as a high value like 999 so that you can find the min score values in each row.
This gives you the distance matrix (see output). Next for each row (email) you find the nearest string (minimum distance) and create it as a tuple of Best Matches (see output)
Now, the issue here is that even if NONE of the string match that well, it will still greedily get the nearest one.
So finally you can take the fuzz.ratio between these strings and get fuzzy scores, which you can filter on. So now, you can avoid garbage matches and only get the ones which truely match above a certain threshold. This gives Final Matching strings (see output)
This should be probably the best you can do with optimizing the code from an algorithmic perspective.
import pandas as pd
import numpy as np
#!pip install python-Levenshtein
from Levenshtein import distance
from fuzzywuzzy import fuzz, process
#CHECKING COMMUTATIVE PROPERTY
distance('abcd','bdca') == distance('bdca', 'abcd')
###OUTPUT - TRUE
fuzz.ratio('abcd','bdca') == fuzz.ratio('bdca', 'abcd')
###OUTPUT - FALSE
#Create dummy matrix
m = np.zeros((len(emailsdb),len(emailsdb)))
#Get lower triangular matrix indices
i,j = np.tril_indices(len(emailsdb))
#iterate over lower triangular and get Levenshtein distance for lower triangular
for k in zip(i,j):
if k[0]!=k[1]:
m[k] = distance(emailsdb[k[0]], emailsdb[k[1]])
else:
m[k] = 0
#Copy lower triangular to upper traingular thanks to COMMUTATIVE PROPERTY
m = m + m.T
#Filling diagonal with 999 so that same string doesnt show up with minimum distance
np.fill_diagonal(m, 999)
print("distance matrix = ")
print(m)
#Get best matches with minimum distance
matches = list(zip([i for i in emailsdb], [emailsdb[i] for i in np.argmin(m, axis=0)]))
print("Best matches = ")
print(matches)
#OPTIONAL -> Filtering out irrelavant matches based on fuzzy match
f = [(*i,fuzz.ratio(*i)) for i in matches if fuzz.ratio(*i)>50]
print("final matching strings - ")
print(f)
distance matrix =
[[999. 5. 8. 8. 4.]
[ 5. 999. 8. 9. 4.]
[ 8. 8. 999. 6. 8.]
[ 8. 9. 6. 999. 9.]
[ 4. 4. 8. 9. 999.]]
Best matches =
[('tim', 'timrook'), ('tim.rand', 'timrook'), ('rand.001', 'pankit090'), ('pankit090', 'rand.001'), ('timrook', 'tim')]
final matching strings -
[('tim', 'timrook', 60), ('tim.rand', 'timrook', 53), ('timrook', 'tim', 60)]