Can anyone help explain why I am getting different results when using 2 query methods that I thought were equivalent? I have read the docs many times, but I still can't figure it out. I am sure I must be missing or misunderstanding something fundamental.
This is my test data:
import pandas as pd
df = pd.DataFrame(
{
"fruit": ["apple", "banana", "cherry", "", pd.NA]
},
dtype = "string"
)
And these are the different query results:
>>> df.query("fruit != ''")
fruit
0 apple
1 banana
2 cherry
4 <NA>
>>> df[df.fruit != '']
fruit
0 apple
1 banana
2 cherry
Why does the first query return the row containing <NA>
and the other query doesn't?
After a deeper investigation, I concluded that there is a clear implementation difference between the query
method and indexing.
I looked at the code, and pandas.query uses the eval
method. This method can use two different parsers: pandas
and python
.
The pandas
parser is the default and brings the result you got including <NA>
.
Unlikely, (I don't know why) the python
parser does not implement ==
and !=
; but you can use df.query("fruit > ''", parser="python")
and see that the result is what you expect.