I have a simple Polars dataframe with some nulls and some NaNs and I want to drop only the latter. I'm trying to use drop_nans()
by applying it to all columns and for whatever reason it replaces NaNs with a literal 1.0.
I am confusion. Maybe I'm using the method wrong, but the docs don't have much info and definitely don't describe this behaviour:
ex = pl.DataFrame(
{
'a': [float('nan'), 1, float('nan')],
'b': [None, 'a', 'b']
}
)
ex.with_columns(pl.all().drop_nans())
Out:
a b
1.0 null
1.0 "a"
1.0 "b"
I'm using the latest Polars 1.5.
What is the correct way of dropping NaNs across all the columns given that in Polars 1.5 dataframes don't seem to have drop_nans()
method, only the Series do?
EDIT: I'm expecting the result should be:
a b
1.0 'a'
What happens in your example is that drop_nans
works on a per-column basis. It will first convert the series [float('nan'), 1, float('nan')]
to [1]
, and then broadcast that value to the entire column when combined with ["a", "b"]
.
It does this because Polars doesn't have a concept of a scalar value yet, and it will treat any column with a single value as such when deciding when to broadcast. In the future this will change, but it is a lot of work. So right now you can see incorrect broadcasts like that when using functions that filter columns, like drop_nan
, if it happens to filter down to a length-1 column.
Instead of doing pl.all().drop_nans()
you should filter to just those rows where column a
is not nan:
>>> df.filter(pl.col.a.is_not_nan())
shape: (1, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ f64 ┆ str │
╞═════╪═════╡
│ 1.0 ┆ a │
└─────┴─────┘
Or more generically, if you have multiple columns with floating-point values:
>>> import polars.selectors as cs
>>> df.filter(pl.all_horizontal(cs.float().is_not_nan()))
shape: (1, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ f64 ┆ str │
╞═════╪═════╡
│ 1.0 ┆ a │
└─────┴─────┘