This is a common question but I have an extra condition: how do I remove matches based on a unique ID? Or, how to prevent matching against itself?
Given a dataframe:
df = pd.DataFrame({'id':[1, 2, 3],
'name':['pizza','pizza toast', 'ramen']})
I used solutions like this one to create a multi-index dataframe:
Fuzzy match strings in one column and create new dataframe using fuzzywuzzy
df_copy = df.copy()
compare = pd.MultiIndex.from_product([df['name'], df_copy['name']]).to_series()
def metrics(tup):
return pd.Series([fuzz.ratio(*tup),
fuzz.token_sort_ratio(*tup)],
['ratio', 'token'])
compare.apply(metrics)
So that's great but how can I use the unique ID to prevent matching against itself?
If there's a case of ID/name = 1/pizza and 10/pizza, obviously I want to keep those. But I need to remove the same ID in both indexes.
I suggest a slightly different approach for the same result using Python standard library difflib module, which provides helpers for computing deltas.
So, with the following dataframe in which pizza
has two different ids (and thus should be checked against one another later on):
import pandas as pd
df = pd.DataFrame(
{"id": [1, 2, 3, 4], "name": ["pizza", "pizza toast", "ramen", "pizza"]}
)
Here is how you can find similarities between different id/name combinations, but avoid checking an id/name combination against itself:
from difflib import SequenceMatcher
# Define a simple helper function
def ratio(a, b):
return SequenceMatcher(None, a, b).ratio()
And then, with the following steps:
# Create a column of unique identifiers: (id, name)
df["id_and_name"] = list(zip(df["id"], df["name"]))
# Calculate ratio only for different id_and_names
df = df.assign(
match=df["id_and_name"].map(
lambda x: {
value: ratio(x[1], value[1])
for value in df["id_and_name"]
if x[0] != value[0] or ratio(x[1], value[1]) != 1
}
)
)
# Format results in a readable fashion
df = (
pd.DataFrame(df["match"].to_list(), index=df["id_and_name"])
.reset_index(drop=False)
.melt("id_and_name", var_name="other_id_and_name", value_name="ratio")
.dropna()
.sort_values(by=["id_and_name", "ratio"], ascending=[True, False])
.reset_index(drop=True)
.pipe(lambda df_: df_.assign(ratio=df_["ratio"] * 100))
.pipe(lambda df_: df_.assign(ratio=df_["ratio"].astype(int)))
)
You get the expected result:
print(df)
# Output
id_and_name other_id_and_name ratio
0 (1, pizza) (4, pizza) 100
1 (1, pizza) (2, pizza toast) 62
2 (1, pizza) (3, ramen) 20
3 (2, pizza toast) (4, pizza) 62
4 (2, pizza toast) (1, pizza) 62
5 (2, pizza toast) (3, ramen) 12
6 (3, ramen) (4, pizza) 20
7 (3, ramen) (1, pizza) 20
8 (3, ramen) (2, pizza toast) 12
9 (4, pizza) (1, pizza) 100
10 (4, pizza) (2, pizza toast) 62
11 (4, pizza) (3, ramen) 20