I need all the rows that have null in one of the predefined columns. I basically need this but i have one more requirement that I cant seem to figure out. Not every column needs to be checked.
I have a function that returns the names of the columns that need to be checked in a list.
Assume this is my dataframe:
import polars as pl
df = pl.from_repr("""
┌───────┬───────┬─────┬───────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ bool │
╞═══════╪═══════╪═════╪═══════╡
│ abc ┆ null ┆ u ┆ true │
│ def ┆ abc ┆ v ┆ true │
│ ghi ┆ def ┆ null┆ true │
│ jkl ┆ uvw ┆ x ┆ true │
│ mno ┆ xyz ┆ y ┆ null │
│ qrs ┆ null ┆ z ┆ null │
└───────┴───────┴─────┴───────┘
""")
Doing this gives me all rows where any of the columns contain null.
df.filter(pl.any_horizontal(pl.all().is_null()))
shape: (4, 4)
┌─────┬──────┬──────┬──────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ bool │
╞═════╪══════╪══════╪══════╡
│ abc ┆ null ┆ u ┆ true │
│ ghi ┆ def ┆ null ┆ true │
│ mno ┆ xyz ┆ y ┆ null │
│ qrs ┆ null ┆ z ┆ null │
└─────┴──────┴──────┴──────┘
Sometimes it's fine for column c to contain a null so let's not check it.
What I want is this:
┌───────┬───────┬─────┬───────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ bool │
╞═══════╪═══════╪═════╪═══════╡
│ abc ┆ null ┆ u ┆ true │
│ mno ┆ xyz ┆ y ┆ null │
│ qrs ┆ null ┆ z ┆ null │
└───────┴───────┴─────┴───────┘
Row 3 is not shown even though there is a null value in column c.
columns = ["a", "b", "d"]
df.filter(pl.any_horizontal(pl.all(*columns).is_null()))
This gives me
SchemaError: invalid series dtype: expected
Boolean, gotstrfor series with namea
How do I get the full rows of data that contain a null in one of ["a", "b", "d"] columns?
If you want exclude some columns you can use .exclude():
import polars as pl
data.filter(pl.any_horizontal(pl.exclude("c").is_null()))
┌─────┬──────┬─────┬──────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ bool │
╞═════╪══════╪═════╪══════╡
│ abc ┆ null ┆ u ┆ true │
│ mno ┆ xyz ┆ y ┆ null │
│ qrs ┆ null ┆ z ┆ null │
└─────┴──────┴─────┴──────┘
Or you can just use column names by using .col():
import polars as pl
cols = ["a","b","d"]
data.filter(pl.any_horizontal(pl.col(cols).is_null()))
shape: (3, 4)
┌─────┬──────┬─────┬──────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ bool │
╞═════╪══════╪═════╪══════╡
│ abc ┆ null ┆ u ┆ true │
│ mno ┆ xyz ┆ y ┆ null │
│ qrs ┆ null ┆ z ┆ null │
└─────┴──────┴─────┴──────┘
If you want to be really flexible, you can use selectors, for example .selectors.exclude():
import polars.selectors as cs
data.filter(pl.any_horizontal(cs.exclude("c").is_null()))
shape: (3, 4)
┌─────┬──────┬─────┬──────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ bool │
╞═════╪══════╪═════╪══════╡
│ abc ┆ null ┆ u ┆ true │
│ mno ┆ xyz ┆ y ┆ null │
│ qrs ┆ null ┆ z ┆ null │
└─────┴──────┴─────┴──────┘