pythonpython-polars

join_where with starts_with in polars


I have two data frames,

df = pl.DataFrame({'url': ['https//abc.com', 'https//abcd.com', 'https//abcd.com/aaa', 'https//abc.com/abcd']})

conditions_df = pl.DataFrame({'url': ['https//abc.com', 'https//abcd.com', 'https//abcd.com/aaa', 'https//abc.com/aaa'], 'category': [['a'], ['b'], ['c'], ['d']]})

now I want a df, for assigning categories to the first df based on first match for the url starts with in second df, that is the output should be,

url category
https//abc.com ['a']
https//abcd.com ['b']
https//abcd.com/aaa ['b'] - this one starts with https//abcd.com, that is the first match
https//abc.com/abcd ['a'] - this one starts with https//abc.com, that is the first match

current code which works is like this,

def add_category_column(df: pl.DataFrame, conditions_df) -> pl.DataFrame:
    
    # Initialize the category column with empty lists
    df = df.with_columns(pl.Series("category", [[] for _ in range(len(df))], dtype=pl.List(pl.String)))
    
    # Apply the conditions to populate the category column
    for row in conditions_df.iter_rows():
        url_start, category = row
        df = df.with_columns(
            pl.when(
                (pl.col("url").str.starts_with(url_start)) & (pl.col("category").list.len() == 0)
            )
            .then(pl.lit(category))
            .otherwise(pl.col("category"))
            .alias("category")
        )
    
    return df

but is there a way to achieve the same without using for loops, could we use join_where here, but in my attempts join_where does not work for starts_with


Solution

  • I would expect pl.DataFrame.join_where() to work, but apparently it doesn't allow pl.Expr.str.starts_with() condition yet - I get only 1 binary comparison allowed as join condition error.

    So you can use pl.DataFrame.join() and pl.DataFrame.filter() instead:

    (
        df
        .join(conditions_df, how="cross")
        .filter(pl.col("url").str.starts_with(pl.col("url_right")))
        .sort("url")
        .group_by("url", maintain_order=True)
        .agg(pl.col.category.first())
    )
    
    shape: (4, 2)
    ┌─────────────────────┬───────────┐
    │ url                 ┆ category  │
    │ ---                 ┆ ---       │
    │ str                 ┆ list[str] │
    ╞═════════════════════╪═══════════╡
    │ https//abc.com      ┆ ["a"]     │
    │ https//abc.com/abcd ┆ ["a"]     │
    │ https//abcd.com     ┆ ["b"]     │
    │ https//abcd.com/aaa ┆ ["b"]     │
    └─────────────────────┴───────────┘
    

    You can also use DuckDB integration with Polars and use lateral join:

    import duckdb
    
    duckdb.sql("""
        select
            d.url,
            c.category
        from df as d,
            lateral (
                select c.category
                from conditions_df as c
                where
                    starts_with(d.url, c.url)
                limit 1
            ) as c
    """)
    
    ┌─────────────────────┬───────────┐
    │         url         │ category  │
    │       varchar       │ varchar[] │
    ├─────────────────────┼───────────┤
    │ https//abc.com      │ [a]       │
    │ https//abc.com/abcd │ [a]       │
    │ https//abcd.com/aaa │ [b]       │
    │ https//abcd.com     │ [b]       │
    └─────────────────────┴───────────┘
    

    however, you have to be careful cause in standard SQL specification row collections are unordered, so I'd not do that in production without adding explicit order by clause into lateral part.