I am working with Polars and need to drop columns that contain only null values during my data preprocessing. However, I am having trouble using the Lazy API to accomplish this.
For instance, given the table below, how can I drop column "a" using Polars' Lazy API?
df = pl.DataFrame(
{
"a": [None, None, None, None],
"b": [1, 2, None, 1],
"c": [1, None, None, 1],
}
)
df
shape: (4, 3)
┌──────┬──────┬──────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ f64 ┆ i64 ┆ i64 │
╞══════╪══════╪══════╡
│ null ┆ 1 ┆ 1 │
│ null ┆ 2 ┆ null │
│ null ┆ null ┆ null │
│ null ┆ 1 ┆ 1 │
└──────┴──────┴──────┘
I am aware of Issue #1613 and the solution of filtering columns where all values are null, but this is not Lazy API.
FYI,
# filter columns where all values are null
df[:, [not (s.null_count() == df.height) for s in df]]
I am also aware of the drop_nulls function in Polars, which can only drop all rows that contain null values, unlike the dropna function in Pandas that can take two arguments, axis
and how
.
Can someone provide an example of how to drop columns with all null values in Polars using the Lazy API?
You can't, at least not in the way you want. polars doesn't know enough about the lazyframe to tell which columns are only nulls until you collect
. That means you need a collect in order to get the columns you want and then another one to materialize the columns you wanted.
Let's turn your df=df.lazy()
Step 1:
(df.select(pl.all().is_null().all())
.unpivot()
.filter(pl.col('value')==False)
.select('variable')
.collect()
.to_series()
.to_list())
Those are your columns that have no nulls so now you wrap it in its own select
Step 2:
(df.select(
df.select(pl.all().is_null().all())
.unpivot()
.filter(pl.col('value')==False)
.select('variable')
.collect()
.to_series()
.to_list())
.collect())