I am confused about different results of boolean indexing when using ~ after != versus when using just ==
I have a pandas df with 4 columns:
dic = {
"a": [1,1,1,0,0,1,1],
"b": [0,0,1,1,0,0,0],
"c": [1,0,1,0,0,1,0],
"d": [0,0,1,0,0,1,0],
}
df = pd.DataFrame(data=dic)
print(df)
a b c d
0 1 0 1 0
1 1 0 0 0
2 1 1 1 1
3 0 1 0 0
4 0 0 0 0
5 1 0 1 1
6 1 0 0 0
I want to subset the whole df dataframe: I want to remove all rows which have all elements zero, but just on the columns b c d, and not on a.
If I use ~ (not) operator after == I get the desired result:
names = ["b","c","d"]
df_A = df.loc[~(df[names] == 0.0).all(axis=1)]
print(df_A)
a b c d
0 1 0 1 0
2 1 1 1 1
3 0 1 0 0
5 1 0 1 1
But when I use just == I get different result:
names = ["b","c","d"]
df_B = df.loc[(df[names] != 0.0).all(axis=1)]
print(df_B)
a b c d
2 1 1 1 1
Do you have any idea why is this the case? Should these two not be the same?
Thank you.
You're not correctly following De Morgan's law.
If you use the opposite condition as input, you have to replace all
(AND) by any
(OR):
df_B = df.loc[(df[names] != 0.0).any(axis=1)]
In English this would be:
~
) rows for which ALL values are 0 (== 0
)!= 0
)In short the rules for inverting input conditions are:
>=
becomes <
; >
becomes <=
)~
)Output:
a b c d
0 1 0 1 0
2 1 1 1 1
3 0 1 0 0
5 1 0 1 1