I have two data frames,
df = pl.DataFrame({'url': ['https//abc.com', 'https//abcd.com', 'https//abcd.com/aaa', 'https//abc.com/abcd']})
conditions_df = pl.DataFrame({'url': ['https//abc.com', 'https//abcd.com', 'https//abcd.com/aaa', 'https//abc.com/aaa'], 'category': [['a'], ['b'], ['c'], ['d']]})
now I want a df, for assigning categories to the first df based on first match for the url starts with in second df, that is the output should be,
url | category |
---|---|
https//abc.com | ['a'] |
https//abcd.com | ['b'] |
https//abcd.com/aaa | ['b'] - this one starts with https//abcd.com, that is the first match |
https//abc.com/abcd | ['a'] - this one starts with https//abc.com, that is the first match |
current code which works is like this,
def add_category_column(df: pl.DataFrame, conditions_df) -> pl.DataFrame:
# Initialize the category column with empty lists
df = df.with_columns(pl.Series("category", [[] for _ in range(len(df))], dtype=pl.List(pl.String)))
# Apply the conditions to populate the category column
for row in conditions_df.iter_rows():
url_start, category = row
df = df.with_columns(
pl.when(
(pl.col("url").str.starts_with(url_start)) & (pl.col("category").list.len() == 0)
)
.then(pl.lit(category))
.otherwise(pl.col("category"))
.alias("category")
)
return df
but is there a way to achieve the same without using for loops, could we use join_where here, but in my attempts join_where does not work for starts_with
I would expect pl.DataFrame.join_where()
to work, but apparently it doesn't allow pl.Expr.str.starts_with()
condition yet - I get only 1 binary comparison allowed as join condition
error.
So you can use pl.DataFrame.join()
and pl.DataFrame.filter()
instead:
(
df
.join(conditions_df, how="cross")
.filter(pl.col("url").str.starts_with(pl.col("url_right")))
.sort("url")
.group_by("url", maintain_order=True)
.agg(pl.col.category.first())
)
shape: (4, 2)
┌─────────────────────┬───────────┐
│ url ┆ category │
│ --- ┆ --- │
│ str ┆ list[str] │
╞═════════════════════╪═══════════╡
│ https//abc.com ┆ ["a"] │
│ https//abc.com/abcd ┆ ["a"] │
│ https//abcd.com ┆ ["b"] │
│ https//abcd.com/aaa ┆ ["b"] │
└─────────────────────┴───────────┘
You can also use DuckDB integration with Polars and use lateral join
:
import duckdb
duckdb.sql("""
select
d.url,
c.category
from df as d,
lateral (
select c.category
from conditions_df as c
where
starts_with(d.url, c.url)
limit 1
) as c
""")
┌─────────────────────┬───────────┐
│ url │ category │
│ varchar │ varchar[] │
├─────────────────────┼───────────┤
│ https//abc.com │ [a] │
│ https//abc.com/abcd │ [a] │
│ https//abcd.com/aaa │ [b] │
│ https//abcd.com │ [b] │
└─────────────────────┴───────────┘
however, you have to be careful cause in standard SQL specification row collections are unordered, so I'd not do that in production without adding explicit order by
clause into lateral part.