I have two polars dataframes containing a unique ID and the name of a utility. I am trying to build a mapping of entries between these two dataframes. I am using polars_fuzzy_match
to do a fuzzy string search against entries. My first dataframe (wg_df
) is approximately a subset of the second (eia_df
). In my code below I am passing each utility_name
from wg_df
into fuzzy_match_score run against the eia_utility_name
. Can I avoid the rowise iteration and vectorize this?
import polars as pl
from polars_fuzzy_match import fuzzy_match_score
# Sample data
# wg_df is approximately a subset of eia_df.
wg_df = pl.DataFrame({"wg_id": [1, 2], "utility_name": ["Utility A", "Utility B"]})
eia_df = pl.DataFrame(
{"eia_id": [101, 102, 103], "utility_name": ["Utility A co.", "Utility B", "utility c"]}
)
out = pl.DataFrame(
schema=[
("wg_id", pl.Int64),
("eia_id", pl.Int64),
("wg_utility_name", pl.String),
("utility_name", pl.String),
("score", pl.UInt32),
],
)
# Iterate through each wg utility and find the best match in eia
# can this be vectorized?
for wg_id, utility in wg_df.iter_rows():
res = (
eia_df.with_columns(score=fuzzy_match_score(pl.col("utility_name"), utility))
.filter(pl.col("score").is_not_null())
.sort(by="score", descending=True)
)
# insert the wg_id and wg_utility_name into the results. They have to be put into the
res.insert_column(
0,
pl.Series("wg_id", [wg_id] * len(res)),
)
res.insert_column(2, pl.Series("wg_utility_name", [utility] * len(res)))
out = out.vstack(res.select([col_name for col_name in out.schema]))
polars-fuzzy-match would need to add support for vectorization. (i.e. at the Rust level)
The polars_ds plugin has vectorized functions backed by the impressive RapidFuzz library.
import polars_ds as pds
(eia_df
.lazy()
.join(wg_df.lazy(), how="cross")
.with_columns(
score = pds.str_fuzz(pl.col.utility_name, pl.col.utility_name_right),
)
.collect()
)
shape: (6, 5)
┌────────┬───────────────┬───────┬────────────────────┬──────────┐
│ eia_id ┆ utility_name ┆ wg_id ┆ utility_name_right ┆ score │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ str ┆ f64 │
╞════════╪═══════════════╪═══════╪════════════════════╪══════════╡
│ 101 ┆ Utility A co. ┆ 1 ┆ Utility A ┆ 0.818182 │
│ 101 ┆ Utility A co. ┆ 2 ┆ Utility B ┆ 0.727273 │
│ 102 ┆ Utility B ┆ 1 ┆ Utility A ┆ 0.888889 │
│ 102 ┆ Utility B ┆ 2 ┆ Utility B ┆ 1.0 │
│ 103 ┆ utility c ┆ 1 ┆ Utility A ┆ 0.777778 │
│ 103 ┆ utility c ┆ 2 ┆ Utility B ┆ 0.777778 │
└────────┴───────────────┴───────┴────────────────────┴──────────┘
It seems polars-fuzzy-match
uses the nucleo library which says it uses the Smith-Waterman algorithm.
It seems like it only gives a scores if they are actual substrings?
(eia_df
.lazy()
.join(wg_df.lazy(), how="cross")
.with_columns(
score = pds.str_fuzz(pl.col.utility_name, pl.col.utility_name_right),
)
.with_columns(
a =
pl.col.utility_name
.str.to_lowercase()
.str.contains(
pl.col.utility_name_right.str.to_lowercase(),
literal = True
),
b =
pl.col.utility_name_right.str.to_lowercase()
.str.contains(
pl.col.utility_name.str.to_lowercase(),
literal = True
)
)
.filter(pl.col.a | pl.col.b)
.collect()
)
shape: (2, 7)
┌────────┬───────────────┬───────┬────────────────────┬──────────┬──────┬───────┐
│ eia_id ┆ utility_name ┆ wg_id ┆ utility_name_right ┆ score ┆ a ┆ b │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ str ┆ f64 ┆ bool ┆ bool │
╞════════╪═══════════════╪═══════╪════════════════════╪══════════╪══════╪═══════╡
│ 101 ┆ Utility A co. ┆ 1 ┆ Utility A ┆ 0.818182 ┆ true ┆ false │
│ 102 ┆ Utility B ┆ 2 ┆ Utility B ┆ 1.0 ┆ true ┆ true │
└────────┴───────────────┴───────┴────────────────────┴──────────┴──────┴───────┘