Here is my dataframe:
df = pd.DataFrame(
dict(Name=['Emma Howard', 'Emma Ward', 'Emma Warner', 'Emma Wayden'],
Age=[33, 34, 43, 44], Score=[90, 95, 93, 92])
)
list2 = df['Name'].tolist()
I am applying fuzzywuzzy process:
process.extractBests(i, list2, score_cutoff=80, scorer=fuzz.ratio)
to extract the best matches on the column Name and it is giving the result as below:
The logic is the "Emma Howard" and "Emma Ward" are already matched in the first row, hence I do not want to show "Emma Howard" in the second row matches and same for the 3rd and fourth rows.
Here is the complete pseudo code:
mat1 = []
list1 = df['Name'].tolist()
list2 = df['Name'].tolist()
list3 = df['Name'].tolist()
for i in list1:
list2 = [x for x in list2 if x != i]
mat1.append(process.extractBests(i, list2, score_cutoff=80, scorer=fuzz.ratio))
list2 = list3
df['matches'] = mat1
df.to_csv("xyz.csv")
IIUC, once a name has been used, it is no longer available for subsequent lines, so you can use set
operations to remove already assigned names:
uniques = set(df['Name'])
matches = {}
for idx, row in df.iterrows():
uniques -= set([row.Name]) # remove current name
res = process.extractBests(row.Name, uniques, score_cutoff=80)
uniques -= set([name for name, score in res]) # remove best results
matches[idx] = res
df['matches'] = pd.Series(matches)
Note: at each iteration, the comparison is faster because there are fewer rows in the set.
Output:
>>> df
Name Age Score matches
0 Emma Howard 33 90 [(Emma Ward, 90)]
1 Emma Ward 34 95 [(Emma Wayden, 80), (Emma Warner, 80)]
2 Emma Warner 43 93 []
3 Emma Wayden 44 92 []