while I was scripting a column, I came into something very interesting. There are two ways in which I was using pd.DataFrame.isna for single and multiple columns. While I am scripting in multiple brackets pd.df.isna is returning the entire code back to me.
override[override.ORIGINAL_CREDITOR_ID.notna()].shape
override[override[['ORIGINAL_CREDITOR_ID']].notna()].shape
So the first line returns me 3880 rows and runs in 2.5ms whereas the second one returns me all the rows present in the override data frame and that too takes 3.08s. Is there a reason why that is happening? How can I avoid this because I have to make it configurable for passing multiple columns in the second query?
The first line of code is selection with a Boolean Series
, while the second is selection with a Boolean DataFrame
, and these are handled very differently as DataFrames are 2D and there are 2 axes to align. There's a section dedicated to illustrating this difference in the pandas docs.
In the first case, selection with a Boolean Series, you return all columns for only the rows that are True
in the Boolean Series.
In the case of selection with a Boolean DataFrame, you return an object the same shape as the original where the True
values in the Boolean DataFrame are kept and any False
values are replaced with NaN
. (It's actually implemented as DataFrame.where
) For rows and columns that don't appear in your Boolean DataFrame mask, those become NaN
by default.
import pandas as pd
df = pd.DataFrame({'a': [1, 2, np.NaN, 4],
'b': [10, 11, 12, 13]})
# Boolean Series, return all columns only for for rows where condition is True
df[df['a'] == 2]
# a b
#1 2.0 11
# Boolean DataFrame, equivalent to df.where(df[['a']] == 2)
df[df[['a']] == 2]
# a b
#0 NaN NaN
#1 2.0 NaN
#2 NaN NaN
#3 NaN NaN