I've seen that if you pass a boolean series to a dataframe of the same length as rows in the dataframe it filters the dataframe. However, if we pass a condition instead of a boolean series (like df['col']==value
) and want to perform boolean operations on that condition (like ~ ) it does not work, even though the condition's result is a boolean series. It only works if it is surrounded by parenthesis. In other words, this works df[~(df['col']>value)]
and this does not df[~df['col']>value]
, notice the only difference are the parenthesis
I thought the parenthesis was doing something to the boolean series resulting from applying df['col']>value
, like casting it into another kind of object that supports operations such as ~
. But it does not, the type(df['col']>value)
and type((df['col']>value))
is the same, whcih is "pandas.core.series.Series". So what are those parenthesis doing that enables the boolean series resulting from using the condition?
Moreover, if you have two boolean_series derived from applying conditions to a dataframe, like
series_a=df['col']>value
and series_b=df['col']==value
and you try to use both of them with an &
operator this way df[series_a & series_b]
it actually works fine. But calculating them inside the dataframe does not works df[df['col']>value & df['col']==value]
, it gives error TypeError: unsupported operand type(s) for &: 'int' and 'IntegerArray'
From that error I would assume there is some precedence in the operators taking place since it seems it's trying to apply the & to an IntegerArray, probably doing this: df['col']> (value & df['col']) ==value
But I would like to ask to confirm
Example:
Supposing we have some dataframe with column tag
that has either values A or B
import pandas as pd
import numpy as np
import random
df=pd.DataFrame({'tag'=[random.choice['A','B' for i in range(100)]}
If I try to filter doing this:
df[~(df['tag']=='A')]
It works, but If I do this without those parenthesis it does not works with this error TypeError: bad operand type for unary ~: 'str'
df[~df['tag']=='A']
It's a question of Operator precedence. When you provide two operations (~
and >
), Python has to decide which one to apply first. In
~df['col']>value
~
has higher precedence so it goes first. You negated the dataframe and then compared. It's the same as (~(df['col'])) > value
.
If you want to compare and then negate, you have to use parentheses to avoid the unwanted order of operations. Expressions inside parens have the highest precedence. In
~(df['col']>value)
the comparison is done first.