I would like to set some cells to null based on a condition. For example:
import pandas as pd # version is 2.2.2
df = pd.DataFrame({'x' : [1, 2, 2, 1, 1, 2]})
df["b"]=False
df.loc[df["x"]==1,"b"]=pd.NA
It works but I get a
FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
I tried reading the documentation and looking at examples, but could not find a solution. What is the correct way to do this?
By defining b
with df['b'] = False
, you set the Series/column's dtype to bool
, and since pd.NA
is not a bool
it cannot be inserted safely in the column, which raises the warning (this will be an error in the future).
You could initialize the column as object:
import numpy as np
df['b'] = np.array(False, dtype='object')
df.loc[df['x']==1, 'b'] = pd.NA
Then df['b'].dtype
is dtype('O')
(object).
Or, better, as nullable boolean:
df['b'] = pd.Series(False, index=df.index, dtype='boolean')
df.loc[df['x']==1, 'b'] = pd.NA
Note that you could also first initialize a nullable boolean column of <NA>
s, then assign False
where df['x']!=1
:
df['b'] = pd.Series(dtype='boolean')
df.loc[df['x']!=1, 'b'] = False
Now df['b'].dtype
is BooleanDtype
(nullable boolean).
Output:
x b
0 1 <NA>
1 2 False
2 2 False
3 1 <NA>
4 1 <NA>
5 2 False