pythonleft-joinpython-polarspandas-explode

Left join of 2 exploded polars columns


Considering

import polars as pl

df = pl.DataFrame({"a": [
    [1, 2],
    [3]],
    "b": [
        [{"id": 1, "x": 1}, {"id": 3, "x": 3}],
        [{"id": 3, "x": 4}]]})

That looks like:

+------+---------------------+
|a     |b                    |
+------+---------------------+
|[1, 2]|[{1,1}, {3,3}]|
|[3]   |[{3,4}]              |
+------+---------------------+

How to


Current approach

.explode both a and b and .filter (INNER JOIN):

df.explode("a").explode("b").filter(
    pl.col("a") == pl.col("b").struct.field('id')
).select(
    pl.col("a"),
    pl.col("b").struct.field("x")
)

Unfortunately I get only the (expected):

+-+----+
|a|b   |
+-+----+
|1|1   |
|3|4   |
+-+----+

Instead of the full "LEFT JOIN" I am aiming to:

+-+----+
|a|b   |
+-+----+
|1|1   |
|2|null|
|3|4   |
+-+----+

How to efficiently get the desired result when the DataFrame is structured like that?


Solution

  • As you have mentioned it, you can also use a left-join explicitly:

    df_a = df.explode("a")
    df_b = df_a.explode("b").unnest("b").filter(pl.col("a") == pl.col("id"))
    
    df_a.select("a").join(df_b, on="a", how="left")
    
    shape: (3, 3)
    ┌─────┬──────┬──────┐
    │ a   ┆ id   ┆ x    │
    │ --- ┆ ---  ┆ ---  │
    │ i64 ┆ i64  ┆ i64  │
    ╞═════╪══════╪══════╡
    │ 1   ┆ 1    ┆ 1    │
    │ 2   ┆ null ┆ null │
    │ 3   ┆ 3    ┆ 4    │
    └─────┴──────┴──────┘
    

    With regards to your current filtering logic:

    You want to pick a single row if there are no matches per group e.g.

    is_id = pl.col("a") == pl.col("id")
    
    is_id | (is_id.not_().all().over("a") & pl.col("a").is_first_distinct())
    

    But it's quite awkward and you still have to null out the non-matches in an extra step.