I'm trying to implement a name duplications for one of our use case.
Here I have a set of 10 names along with their index column as below.
df = pl.from_repr("""
┌───────┬───────────────────┐
│ index ┆ full_name │
│ --- ┆ --- │
│ u32 ┆ str │
╞═══════╪═══════════════════╡
│ 0 ┆ Mallesham Yamulla │
│ 1 ┆ Velmala Sharath │
│ 2 ┆ Jagarini Yegurla │
│ 3 ┆ Sharath Velmala │
│ 4 ┆ Bhavik Vemulla │
│ 5 ┆ Yegurla Mahesh │
│ 6 ┆ Yegurla Jagarini │
│ 7 ┆ Vermula Bhavik │
│ 8 ┆ Mahesh Yegurla │
│ 9 ┆ Yamulla Mallesham │
└───────┴───────────────────┘
""")
Here I would like to calculate fuzzy metrics(Levenshtein,JaroWinkler) per each of name combinations using a rapidxfuzz module as below.
from rapidfuzz import fuzz
from rapidfuzz.distance import Levenshtein,JaroWinkler
round(Levenshtein.normalized_similarity(name_0,name_1),5)
round(JaroWinkler.similarity(name_0,name_1),5)
For example: idx-0 name Mallesham Yamulla to be paired with names having indexes sequence (1,9) names[(0,1),(0,2),(0,3),(0,4),(0,5),(0,6),(0,7),(0,8),(0,9)] and calculate their levenshtein and Jarowrinkler similar percentages.
Next idx-1 name with names index sequence (2,9), idx-2 with name index sequence (3,9), idx-3 with (4,9) so on so forth till (8,9)
The expected output would be :
Create example dataframe.
df = pl.DataFrame(
pl.Series("full_name", ["Aaaa aaaa", "Baaa abba", "Acac acca", "Dada dddd"])
).with_row_index()
shape: (4, 2)
┌───────┬───────────┐
│ index ┆ full_name │
│ --- ┆ --- │
│ u32 ┆ str │
╞═══════╪═══════════╡
│ 0 ┆ Aaaa aaaa │
│ 1 ┆ Baaa abba │
│ 2 ┆ Acac acca │
│ 3 ┆ Dada dddd │
└───────┴───────────┘
Join dataframe with itself in a cross join and remove rows where index == index
df_combinations = df.join(
df,
how="cross",
suffix="_2",
).filter(
pl.col("index") != pl.col("index_2")
)
shape: (12, 4)
┌───────┬───────────┬─────────┬─────────────┐
│ index ┆ full_name ┆ index_2 ┆ full_name_2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ u32 ┆ str │
╞═══════╪═══════════╪═════════╪═════════════╡
│ 0 ┆ Aaaa aaaa ┆ 1 ┆ Baaa abba │
│ 0 ┆ Aaaa aaaa ┆ 2 ┆ Acac acca │
│ 0 ┆ Aaaa aaaa ┆ 3 ┆ Dada dddd │
│ 1 ┆ Baaa abba ┆ 0 ┆ Aaaa aaaa │
│ 1 ┆ Baaa abba ┆ 2 ┆ Acac acca │
│ … ┆ … ┆ … ┆ … │
│ 2 ┆ Acac acca ┆ 1 ┆ Baaa abba │
│ 2 ┆ Acac acca ┆ 3 ┆ Dada dddd │
│ 3 ┆ Dada dddd ┆ 0 ┆ Aaaa aaaa │
│ 3 ┆ Dada dddd ┆ 1 ┆ Baaa abba │
│ 3 ┆ Dada dddd ┆ 2 ┆ Acac acca │
└───────┴───────────┴─────────┴─────────────┘
Run rapidfuzz using map_elements
df_combinations.with_columns(
# Combine "index" and "index_2" columns to one struct column.
pl.struct("index", "index_2").alias("index_comb"),
# Combine "full_name" and "full_name_2" columns to one struct column.
pl.struct("full_name", "full_name_2").alias("full_name_comb"),
).with_columns(
# Run custom functions on struct column.
pl.col("full_name_comb").map_elements(lambda t: Levenshtein.normalized_similarity(t["full_name"], t["full_name_2"])).alias("levenshtein"),
pl.col("full_name_comb").map_elements(lambda t: JaroWinkler.similarity(t["full_name"], t["full_name_2"])).alias("jarowinkler"),
)
shape: (12, 8)
┌───────┬───────────┬─────────┬─────────────┬────────────┬───────────────────────────┬─────────────┬─────────────┐
│ index ┆ full_name ┆ index_2 ┆ full_name_2 ┆ index_comb ┆ full_name_comb ┆ levenshtein ┆ jarowinkler │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ u32 ┆ str ┆ struct[2] ┆ struct[2] ┆ f64 ┆ f64 │
╞═══════╪═══════════╪═════════╪═════════════╪════════════╪═══════════════════════════╪═════════════╪═════════════╡
│ 0 ┆ Aaaa aaaa ┆ 1 ┆ Baaa abba ┆ {0,1} ┆ {"Aaaa aaaa","Baaa abba"} ┆ 0.666667 ┆ 0.777778 │
│ 0 ┆ Aaaa aaaa ┆ 2 ┆ Acac acca ┆ {0,2} ┆ {"Aaaa aaaa","Acac acca"} ┆ 0.555556 ┆ 0.637037 │
│ 0 ┆ Aaaa aaaa ┆ 3 ┆ Dada dddd ┆ {0,3} ┆ {"Aaaa aaaa","Dada dddd"} ┆ 0.333333 ┆ 0.555556 │
│ 1 ┆ Baaa abba ┆ 0 ┆ Aaaa aaaa ┆ {1,0} ┆ {"Baaa abba","Aaaa aaaa"} ┆ 0.666667 ┆ 0.777778 │
│ 1 ┆ Baaa abba ┆ 2 ┆ Acac acca ┆ {1,2} ┆ {"Baaa abba","Acac acca"} ┆ 0.444444 ┆ 0.546296 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 2 ┆ Acac acca ┆ 1 ┆ Baaa abba ┆ {2,1} ┆ {"Acac acca","Baaa abba"} ┆ 0.444444 ┆ 0.546296 │
│ 2 ┆ Acac acca ┆ 3 ┆ Dada dddd ┆ {2,3} ┆ {"Acac acca","Dada dddd"} ┆ 0.111111 ┆ 0.444444 │
│ 3 ┆ Dada dddd ┆ 0 ┆ Aaaa aaaa ┆ {3,0} ┆ {"Dada dddd","Aaaa aaaa"} ┆ 0.333333 ┆ 0.555556 │
│ 3 ┆ Dada dddd ┆ 1 ┆ Baaa abba ┆ {3,1} ┆ {"Dada dddd","Baaa abba"} ┆ 0.333333 ┆ 0.555556 │
│ 3 ┆ Dada dddd ┆ 2 ┆ Acac acca ┆ {3,2} ┆ {"Dada dddd","Acac acca"} ┆ 0.111111 ┆ 0.444444 │
└───────┴───────────┴─────────┴─────────────┴────────────┴───────────────────────────┴─────────────┴─────────────┘