I am trying to remove null values across a list of selected columns. But it seems that I might have got the with_columns
operation not right. What's the right approach if you want to operate the removing only on selected columns?
df = pl.DataFrame(
{
"id": ["NY", "TK", "FD"],
"eat2000": [1, None, 3],
"eat2001": [-2, None, 4],
"eat2002": [None, None, None],
"eat2003": [-9, None, 8],
"eat2004": [None, None, 8]
}
); df
┌─────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ id ┆ eat2000 ┆ eat2001 ┆ eat2002 ┆ eat2003 ┆ eat2004 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ f64 ┆ i64 ┆ i64 │
╞═════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ NY ┆ 1 ┆ -2 ┆ null ┆ -9 ┆ null │
│ TK ┆ null ┆ null ┆ null ┆ null ┆ null │
│ FD ┆ 3 ┆ 4 ┆ null ┆ 8 ┆ 8 │
└─────┴─────────┴─────────┴─────────┴─────────┴─────────┘
col_list = [word for word in df.columns if word.startswith(("eat"))]
(
df
.with_columns(
pl.col(col_list).filter(~pl.fold(True, lambda acc, s: acc & s.is_null(), pl.all()))
)
)
# InvalidOperationError: dtype String not supported in 'not' operation
Expected output:
┌─────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ id ┆ eat2000 ┆ eat2001 ┆ eat2002 ┆ eat2003 ┆ eat2004 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ f64 ┆ i64 ┆ i64 │
╞═════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ NY ┆ 1 ┆ -2 ┆ null ┆ -9 ┆ null │
│ FD ┆ 3 ┆ 4 ┆ null ┆ 8 ┆ 8 │
└─────┴─────────┴─────────┴─────────┴─────────┴─────────┘
polars.col
also accepts Regular expressions which is one way to select all columns that start with a specific string.polars.all_horizontal
combines all results horizontally (i.e., row-wise) to give a single True/False value per row.df.select(
~pl.all_horizontal(pl.col(r'^eat.*$').is_null())
)
shape: (3, 1)
┌───────┐
│ all │
│ --- │
│ bool │
╞═══════╡
│ true │
│ false │
│ true │
└───────┘
DataFrame.filter
can be used to keep only the true
rows:
df.filter(
~pl.all_horizontal(pl.col(r'^eat.*$').is_null())
)
shape: (2, 6)
┌─────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ id ┆ eat2000 ┆ eat2001 ┆ eat2002 ┆ eat2003 ┆ eat2004 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ f32 ┆ i64 ┆ i64 │
╞═════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ NY ┆ 1 ┆ -2 ┆ null ┆ -9 ┆ null │
│ FD ┆ 3 ┆ 4 ┆ null ┆ 8 ┆ 8 │
└─────┴─────────┴─────────┴─────────┴─────────┴─────────┘
The ~
in front of the pl.all_horizontal
stands for negation. Notice that we didn't need the col_list
.
One caution: the regex expression in the pl.col
must start with ^
and end with $
. These cannot be omitted, even if the resulting regex expression is otherwise valid.
Alternately, if you don't like the ~
operator, there is .not_()
df.filter(
pl.all_horizontal(pl.col(r'^eat.*$').is_null()).not_()
)
Or, we can check if there are any non-null values instead:
df.filter(
pl.any_horizontal(pl.col(r'^eat.*$').is_not_null())
)
As an aside, Polars has other dedicated horizontal functions e.g. min_horizontal
, max_horizontal
, sum_horizontal
FYI, here's how to use the fold
method, if that is what you'd prefer. Note the use of pl.col
with a regex expression.
df.filter(
~pl.fold(True, lambda acc, s: acc & s.is_null(), exprs=pl.col(r'^eat.*$'))
)
shape: (2, 6)
┌─────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ id ┆ eat2000 ┆ eat2001 ┆ eat2002 ┆ eat2003 ┆ eat2004 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ null ┆ i64 ┆ i64 │
╞═════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ NY ┆ 1 ┆ -2 ┆ null ┆ -9 ┆ null │
│ FD ┆ 3 ┆ 4 ┆ null ┆ 8 ┆ 8 │
└─────┴─────────┴─────────┴─────────┴─────────┴─────────┘