python-3.xpandasfuzzywuzzy

pandas fuzzy match on the same column but prevent matching against itself


This is a common question but I have an extra condition: how do I remove matches based on a unique ID? Or, how to prevent matching against itself?

Given a dataframe:

df = pd.DataFrame({'id':[1, 2, 3],
                   'name':['pizza','pizza toast', 'ramen']})

enter image description here

I used solutions like this one to create a multi-index dataframe:

Fuzzy match strings in one column and create new dataframe using fuzzywuzzy

df_copy = df.copy()

compare = pd.MultiIndex.from_product([df['name'], df_copy['name']]).to_series()

def metrics(tup):
    return pd.Series([fuzz.ratio(*tup),
                      fuzz.token_sort_ratio(*tup)],
                     ['ratio', 'token'])

compare.apply(metrics)

enter image description here

So that's great but how can I use the unique ID to prevent matching against itself?

If there's a case of ID/name = 1/pizza and 10/pizza, obviously I want to keep those. But I need to remove the same ID in both indexes.


Solution

  • I suggest a slightly different approach for the same result using Python standard library difflib module, which provides helpers for computing deltas.

    So, with the following dataframe in which pizza has two different ids (and thus should be checked against one another later on):

    import pandas as pd
    
    df = pd.DataFrame(
        {"id": [1, 2, 3, 4], "name": ["pizza", "pizza toast", "ramen", "pizza"]}
    )
    

    Here is how you can find similarities between different id/name combinations, but avoid checking an id/name combination against itself:

    from difflib import SequenceMatcher
    
    # Define a simple helper function
    def ratio(a, b):
        return SequenceMatcher(None, a, b).ratio()
    

    And then, with the following steps:

    # Create a column of unique identifiers: (id, name)
    df["id_and_name"] = list(zip(df["id"], df["name"]))
    
    # Calculate ratio only for different id_and_names
    df = df.assign(
        match=df["id_and_name"].map(
            lambda x: {
                value: ratio(x[1], value[1])
                for value in df["id_and_name"]
                if x[0] != value[0] or ratio(x[1], value[1]) != 1
            }
        )
    )
    
    # Format results in a readable fashion
    df = (
        pd.DataFrame(df["match"].to_list(), index=df["id_and_name"])
        .reset_index(drop=False)
        .melt("id_and_name", var_name="other_id_and_name", value_name="ratio")
        .dropna()
        .sort_values(by=["id_and_name", "ratio"], ascending=[True, False])
        .reset_index(drop=True)
        .pipe(lambda df_: df_.assign(ratio=df_["ratio"] * 100))
        .pipe(lambda df_: df_.assign(ratio=df_["ratio"].astype(int)))
    )
    

    You get the expected result:

    print(df)
    # Output
             id_and_name other_id_and_name  ratio
    0         (1, pizza)        (4, pizza)    100
    1         (1, pizza)  (2, pizza toast)     62
    2         (1, pizza)        (3, ramen)     20
    3   (2, pizza toast)        (4, pizza)     62
    4   (2, pizza toast)        (1, pizza)     62
    5   (2, pizza toast)        (3, ramen)     12
    6         (3, ramen)        (4, pizza)     20
    7         (3, ramen)        (1, pizza)     20
    8         (3, ramen)  (2, pizza toast)     12
    9         (4, pizza)        (1, pizza)    100
    10        (4, pizza)  (2, pizza toast)     62
    11        (4, pizza)        (3, ramen)     20