pythonpython-polars

Prevent `None` values from matching in Python-Polars


How can I avoid None matching when comparing values in two different polars dataframes? The equality treatment of None seems inconsistent when using is_in vs join.

For instance, if I start with the dataframes main_pl (with all records) and sub_pl (with a subset of records):

import polars as pl

r1 = {"foo":"a", "bar":"b", "baz":"c"}
r2 = {"foo":"x", "bar":"y", "baz":"z"}
r3 = {"foo":"a", "bar":"b", "baz":None}
r4 = {"foo":"m", "bar":"n", "baz":"o"}
r5 = {"foo":"x", "bar":"y", "baz":None}
r6 = {"foo":"a", "bar":"b", "baz":None}

all_records = [r1, r2, r3, r4, r5, r6]
sub_records = [r1, r2, r3]

target_cols = ["foo", "bar", "baz"]

main_pl = pl.DataFrame(all_records)
sub_pl = pl.DataFrame(sub_records)

main_pl:

shape: (6, 3)

┌─────┬─────┬──────┐
│ foo ┆ bar ┆ baz  │
│ --- ┆ --- ┆ ---  │
│ str ┆ str ┆ str  │
╞═════╪═════╪══════╡
│ a   ┆ b   ┆ c    │
│ x   ┆ y   ┆ z    │
│ a   ┆ b   ┆ null │
│ m   ┆ n   ┆ o    │
│ x   ┆ y   ┆ null │
│ a   ┆ b   ┆ null │
└─────┴─────┴──────┘

and rows for_matching from sub_pl:

shape: (3, 3)
┌─────┬─────┬──────┐
│ foo ┆ bar ┆ baz  │
│ --- ┆ --- ┆ ---  │
│ str ┆ str ┆ str  │
╞═════╪═════╪══════╡
│ a   ┆ b   ┆ c    │
│ x   ┆ y   ┆ z    │
│ a   ┆ b   ┆ null │
└─────┴─────┴──────┘
for_matching = sub_pl.select(
    pl.struct(pl.col(target_cols)).alias("for_matching")
).get_column("for_matching")
for_matching
shape: (3,)
Series: 'for_matching' [struct[3]]
[
    {"a","b","c"}
    {"x","y","z"}
    {"a","b",null}
]

1. When comparing None values, is the difference in the default behavior between is_in and join inconsistent, or is there a reason or way to control what seems like different results?

--- is_in

main_pl.with_columns(
 pl.struct(pl.all()).is_in(for_matching.implode()).alias("matched"))
shape: (6, 4)
┌─────┬─────┬──────┬─────────┐
│ foo ┆ bar ┆ baz  ┆ matched │
│ --- ┆ --- ┆ ---  ┆ ---     │
│ str ┆ str ┆ str  ┆ bool    │
╞═════╪═════╪══════╪═════════╡
│ a   ┆ b   ┆ c    ┆ true    │
│ x   ┆ y   ┆ z    ┆ true    │
│ a   ┆ b   ┆ null ┆ true    │
│ m   ┆ n   ┆ o    ┆ false   │
│ x   ┆ y   ┆ null ┆ false   │
│ a   ┆ b   ┆ null ┆ true    │
└─────┴─────┴──────┴─────────┘

--- join

main_pl.join(sub_pl, on=target_cols)
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ baz │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════╪═════╪═════╡
│ a   ┆ b   ┆ c   │
│ x   ┆ y   ┆ z   │
└─────┴─────┴─────┘

2. Is there a way to make polars is_in operate similar to pandas isin methods when comparing None values?

import pandas as pd

main_pd = pd.DataFrame(all_records)
sub_pd = pd.DataFrame(sub_records)

main_pd["matched_w_none"] = main_pd[target_cols].isin(sub_pd).all(1)
pl.from_pandas(main_pd)
shape: (6, 4)
┌─────┬─────┬──────┬────────────────┐
│ foo ┆ bar ┆ baz  ┆ matched_w_none │
│ --- ┆ --- ┆ ---  ┆ ---            │
│ str ┆ str ┆ str  ┆ bool           │
╞═════╪═════╪══════╪════════════════╡
│ a   ┆ b   ┆ c    ┆ true           │
│ x   ┆ y   ┆ z    ┆ true           │
│ a   ┆ b   ┆ null ┆ false          │
│ m   ┆ n   ┆ o    ┆ false          │
│ x   ┆ y   ┆ null ┆ false          │
│ a   ┆ b   ┆ null ┆ false          │
└─────┴─────┴──────┴────────────────┘

Ultimately I'm trying to achieve the same results through polars that I get by default with pandas when comparing rows from different dataframes where None values might be present.


Solution

  • I'm not sure if your .join() example is actually equivalent if you don't use a struct:

    main_pl.join(sub_pl, on=pl.struct(target_cols))
    
    shape: (4, 6)
    ┌─────┬─────┬──────┬───────────┬───────────┬───────────┐
    │ foo ┆ bar ┆ baz  ┆ foo_right ┆ bar_right ┆ baz_right │
    │ --- ┆ --- ┆ ---  ┆ ---       ┆ ---       ┆ ---       │
    │ str ┆ str ┆ str  ┆ str       ┆ str       ┆ str       │
    ╞═════╪═════╪══════╪═══════════╪═══════════╪═══════════╡
    │ a   ┆ b   ┆ c    ┆ a         ┆ b         ┆ c         │
    │ x   ┆ y   ┆ z    ┆ x         ┆ y         ┆ z         │
    │ a   ┆ b   ┆ null ┆ a         ┆ b         ┆ null      │
    │ a   ┆ b   ┆ null ┆ a         ┆ b         ┆ null      │
    └─────┴─────┴──────┴───────────┴───────────┴───────────┘
    

    is_in did change recently with regards to null propagation:

    import polars as pl
    
    df = pl.select(x = None, y = [None])
    df = df.cast({"x": pl.String, "y": pl.List(pl.String)})
    
    print(
        df.with_columns(z = pl.col.x.is_in("y"))
    )
    
    uv run --isolated --with polars==1.27.1 a.py
    
    shape: (1, 3)
    ┌──────┬───────────┬──────┐
    │ x    ┆ y         ┆ z    │
    │ ---  ┆ ---       ┆ ---  │
    │ str  ┆ list[str] ┆ bool │
    ╞══════╪═══════════╪══════╡
    │ null ┆ [null]    ┆ true │
    └──────┴───────────┴──────┘
    
    uv run --isolated --with polars==1.28.0 a.py
    
    shape: (1, 3)
    ┌──────┬───────────┬──────┐
    │ x    ┆ y         ┆ z    │
    │ ---  ┆ ---       ┆ ---  │
    │ str  ┆ list[str] ┆ bool │
    ╞══════╪═══════════╪══════╡
    │ null ┆ [null]    ┆ null │
    └──────┴───────────┴──────┘
    

    However, I cannot find any discussions about the LHS being a nested type (Struct, List, Array) that contains nulls and how that should behave.

    import polars as pl
    
    df = pl.select(x = [None], y = [[None]])
    df = df.cast({"x": pl.List(pl.String), "y": pl.List(pl.List(pl.String))})
    
    print(
        df.with_columns(z = pl.col.x.is_in("y"))
    )
    
    shape: (1, 3)
    ┌───────────┬─────────────────┬──────┐
    │ x         ┆ y               ┆ z    │
    │ ---       ┆ ---             ┆ ---  │
    │ list[str] ┆ list[list[str]] ┆ bool │
    ╞═══════════╪═════════════════╪══════╡
    │ [null]    ┆ [[null]]        ┆ true │
    └───────────┴─────────────────┴──────┘
    

    For your specific use case, you could add an extra is_not_null check:

    main_pl.with_columns(
        pl.all_horizontal(
            pl.all().is_not_null(),
            pl.struct(pl.all()).is_in(for_matching.implode())
        )
        .alias("matched")
    )
    
    shape: (6, 4)
    ┌─────┬─────┬──────┬─────────┐
    │ foo ┆ bar ┆ baz  ┆ matched │
    │ --- ┆ --- ┆ ---  ┆ ---     │
    │ str ┆ str ┆ str  ┆ bool    │
    ╞═════╪═════╪══════╪═════════╡
    │ a   ┆ b   ┆ c    ┆ true    │
    │ x   ┆ y   ┆ z    ┆ true    │
    │ a   ┆ b   ┆ null ┆ false   │
    │ m   ┆ n   ┆ o    ┆ false   │
    │ x   ┆ y   ┆ null ┆ false   │
    │ a   ┆ b   ┆ null ┆ false   │
    └─────┴─────┴──────┴─────────┘