What is the difference between pd.isna for columns in single bracket vs multiple bracket? It does not return na values in multiple brackets

while I was scripting a column, I came into something very interesting. There are two ways in which I was using pd.DataFrame.isna for single and multiple columns. While I am scripting in multiple brackets pd.df.isna is returning the entire code back to me.

override[override.ORIGINAL_CREDITOR_ID.notna()].shape

override[override[['ORIGINAL_CREDITOR_ID']].notna()].shape

So the first line returns me 3880 rows and runs in 2.5ms whereas the second one returns me all the rows present in the override data frame and that too takes 3.08s. Is there a reason why that is happening? How can I avoid this because I have to make it configurable for passing multiple columns in the second query?

Solution

The first line of code is selection with a Boolean Series, while the second is selection with a Boolean DataFrame, and these are handled very differently as DataFrames are 2D and there are 2 axes to align. There's a section dedicated to illustrating this difference in the pandas docs.

In the first case, selection with a Boolean Series, you return all columns for only the rows that are True in the Boolean Series.

In the case of selection with a Boolean DataFrame, you return an object the same shape as the original where the True values in the Boolean DataFrame are kept and any False values are replaced with NaN. (It's actually implemented as DataFrame.where) For rows and columns that don't appear in your Boolean DataFrame mask, those become NaN by default.

import pandas as pd
df = pd.DataFrame({'a': [1, 2, np.NaN, 4],
                   'b': [10, 11, 12, 13]})

# Boolean Series, return all columns only for for rows where condition is True
df[df['a'] == 2]
#     a   b
#1  2.0  11


# Boolean DataFrame, equivalent to df.where(df[['a']] == 2)
df[df[['a']] == 2]
#     a   b
#0  NaN NaN
#1  2.0 NaN
#2  NaN NaN
#3  NaN NaN