How can I avoid None
matching when comparing values in two different polars dataframes? The equality treatment of None
seems inconsistent when using is_in
vs join
.
For instance, if I start with the dataframes main_pl (with all records) and sub_pl (with a subset of records):
import polars as pl
r1 = {"foo":"a", "bar":"b", "baz":"c"}
r2 = {"foo":"x", "bar":"y", "baz":"z"}
r3 = {"foo":"a", "bar":"b", "baz":None}
r4 = {"foo":"m", "bar":"n", "baz":"o"}
r5 = {"foo":"x", "bar":"y", "baz":None}
r6 = {"foo":"a", "bar":"b", "baz":None}
all_records = [r1, r2, r3, r4, r5, r6]
sub_records = [r1, r2, r3]
target_cols = ["foo", "bar", "baz"]
main_pl = pl.DataFrame(all_records)
sub_pl = pl.DataFrame(sub_records)
main_pl:
shape: (6, 3)
┌─────┬─────┬──────┐
│ foo ┆ bar ┆ baz │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════╪═════╪══════╡
│ a ┆ b ┆ c │
│ x ┆ y ┆ z │
│ a ┆ b ┆ null │
│ m ┆ n ┆ o │
│ x ┆ y ┆ null │
│ a ┆ b ┆ null │
└─────┴─────┴──────┘
and rows for_matching from sub_pl:
shape: (3, 3)
┌─────┬─────┬──────┐
│ foo ┆ bar ┆ baz │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════╪═════╪══════╡
│ a ┆ b ┆ c │
│ x ┆ y ┆ z │
│ a ┆ b ┆ null │
└─────┴─────┴──────┘
for_matching = sub_pl.select(
pl.struct(pl.col(target_cols)).alias("for_matching")
).get_column("for_matching")
for_matching
shape: (3,)
Series: 'for_matching' [struct[3]]
[
{"a","b","c"}
{"x","y","z"}
{"a","b",null}
]
1. When comparing None
values, is the difference in the default behavior between is_in
and join
inconsistent, or is there a reason or way to control what seems like different results?
--- is_in
main_pl.with_columns(
pl.struct(pl.all()).is_in(for_matching.implode()).alias("matched"))
shape: (6, 4)
┌─────┬─────┬──────┬─────────┐
│ foo ┆ bar ┆ baz ┆ matched │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ bool │
╞═════╪═════╪══════╪═════════╡
│ a ┆ b ┆ c ┆ true │
│ x ┆ y ┆ z ┆ true │
│ a ┆ b ┆ null ┆ true │
│ m ┆ n ┆ o ┆ false │
│ x ┆ y ┆ null ┆ false │
│ a ┆ b ┆ null ┆ true │
└─────┴─────┴──────┴─────────┘
--- join
main_pl.join(sub_pl, on=target_cols)
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ baz │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════╪═════╪═════╡
│ a ┆ b ┆ c │
│ x ┆ y ┆ z │
└─────┴─────┴─────┘
2. Is there a way to make polars is_in
operate similar to pandas isin
methods when comparing None
values?
import pandas as pd
main_pd = pd.DataFrame(all_records)
sub_pd = pd.DataFrame(sub_records)
main_pd["matched_w_none"] = main_pd[target_cols].isin(sub_pd).all(1)
pl.from_pandas(main_pd)
shape: (6, 4)
┌─────┬─────┬──────┬────────────────┐
│ foo ┆ bar ┆ baz ┆ matched_w_none │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ bool │
╞═════╪═════╪══════╪════════════════╡
│ a ┆ b ┆ c ┆ true │
│ x ┆ y ┆ z ┆ true │
│ a ┆ b ┆ null ┆ false │
│ m ┆ n ┆ o ┆ false │
│ x ┆ y ┆ null ┆ false │
│ a ┆ b ┆ null ┆ false │
└─────┴─────┴──────┴────────────────┘
Ultimately I'm trying to achieve the same results through polars that I get by default with pandas when comparing rows from different dataframes where None
values might be present.
I'm not sure if your .join()
example is actually equivalent if you don't use a struct:
main_pl.join(sub_pl, on=pl.struct(target_cols))
shape: (4, 6)
┌─────┬─────┬──────┬───────────┬───────────┬───────────┐
│ foo ┆ bar ┆ baz ┆ foo_right ┆ bar_right ┆ baz_right │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ str ┆ str │
╞═════╪═════╪══════╪═══════════╪═══════════╪═══════════╡
│ a ┆ b ┆ c ┆ a ┆ b ┆ c │
│ x ┆ y ┆ z ┆ x ┆ y ┆ z │
│ a ┆ b ┆ null ┆ a ┆ b ┆ null │
│ a ┆ b ┆ null ┆ a ┆ b ┆ null │
└─────┴─────┴──────┴───────────┴───────────┴───────────┘
is_in
did change recently with regards to null propagation:
import polars as pl
df = pl.select(x = None, y = [None])
df = df.cast({"x": pl.String, "y": pl.List(pl.String)})
print(
df.with_columns(z = pl.col.x.is_in("y"))
)
uv run --isolated --with polars==1.27.1 a.py
shape: (1, 3)
┌──────┬───────────┬──────┐
│ x ┆ y ┆ z │
│ --- ┆ --- ┆ --- │
│ str ┆ list[str] ┆ bool │
╞══════╪═══════════╪══════╡
│ null ┆ [null] ┆ true │
└──────┴───────────┴──────┘
uv run --isolated --with polars==1.28.0 a.py
shape: (1, 3)
┌──────┬───────────┬──────┐
│ x ┆ y ┆ z │
│ --- ┆ --- ┆ --- │
│ str ┆ list[str] ┆ bool │
╞══════╪═══════════╪══════╡
│ null ┆ [null] ┆ null │
└──────┴───────────┴──────┘
However, I cannot find any discussions about the LHS being a nested type (Struct, List, Array) that contains nulls and how that should behave.
import polars as pl
df = pl.select(x = [None], y = [[None]])
df = df.cast({"x": pl.List(pl.String), "y": pl.List(pl.List(pl.String))})
print(
df.with_columns(z = pl.col.x.is_in("y"))
)
shape: (1, 3)
┌───────────┬─────────────────┬──────┐
│ x ┆ y ┆ z │
│ --- ┆ --- ┆ --- │
│ list[str] ┆ list[list[str]] ┆ bool │
╞═══════════╪═════════════════╪══════╡
│ [null] ┆ [[null]] ┆ true │
└───────────┴─────────────────┴──────┘
For your specific use case, you could add an extra is_not_null
check:
main_pl.with_columns(
pl.all_horizontal(
pl.all().is_not_null(),
pl.struct(pl.all()).is_in(for_matching.implode())
)
.alias("matched")
)
shape: (6, 4)
┌─────┬─────┬──────┬─────────┐
│ foo ┆ bar ┆ baz ┆ matched │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ bool │
╞═════╪═════╪══════╪═════════╡
│ a ┆ b ┆ c ┆ true │
│ x ┆ y ┆ z ┆ true │
│ a ┆ b ┆ null ┆ false │
│ m ┆ n ┆ o ┆ false │
│ x ┆ y ┆ null ┆ false │
│ a ┆ b ┆ null ┆ false │
└─────┴─────┴──────┴─────────┘