pythonpandasdataframefuzzywuzzy

How to avoid cyclic matches in fuzzywuzzy


Here is my dataframe:

df = pd.DataFrame(
    dict(Name=['Emma Howard', 'Emma Ward', 'Emma Warner', 'Emma Wayden'],
         Age=[33, 34, 43, 44], Score=[90, 95, 93, 92])
)

list2 = df['Name'].tolist()

I am applying fuzzywuzzy process:

process.extractBests(i, list2, score_cutoff=80, scorer=fuzz.ratio)

to extract the best matches on the column Name and it is giving the result as below: enter image description here

The output I'm expecting is: enter image description here

The logic is the "Emma Howard" and "Emma Ward" are already matched in the first row, hence I do not want to show "Emma Howard" in the second row matches and same for the 3rd and fourth rows.

Here is the complete pseudo code:

mat1 = []
list1 = df['Name'].tolist()
list2 = df['Name'].tolist()
list3 = df['Name'].tolist()

for i in list1:
    list2 = [x for x in list2 if x != i]
    mat1.append(process.extractBests(i, list2, score_cutoff=80, scorer=fuzz.ratio))
    list2 = list3
df['matches'] = mat1
df.to_csv("xyz.csv")

Solution

  • IIUC, once a name has been used, it is no longer available for subsequent lines, so you can use set operations to remove already assigned names:

    uniques = set(df['Name'])
    matches = {}
    for idx, row in df.iterrows():
        uniques -= set([row.Name])  # remove current name
        res = process.extractBests(row.Name, uniques, score_cutoff=80)
        uniques -= set([name for name, score in res])  # remove best results
        matches[idx] = res
    df['matches'] = pd.Series(matches)
    

    Note: at each iteration, the comparison is faster because there are fewer rows in the set.

    Output:

    >>> df
              Name  Age  Score                                 matches
    0  Emma Howard   33     90                       [(Emma Ward, 90)]
    1    Emma Ward   34     95  [(Emma Wayden, 80), (Emma Warner, 80)]
    2  Emma Warner   43     93                                      []
    3  Emma Wayden   44     92                                      []