pythonpandas

pandas queries for non-empty strings giving different results


Can anyone help explain why I am getting different results when using 2 query methods that I thought were equivalent? I have read the docs many times, but I still can't figure it out. I am sure I must be missing or misunderstanding something fundamental.

This is my test data:

import pandas as pd

df = pd.DataFrame(
    {
        "fruit": ["apple", "banana", "cherry", "", pd.NA]
    },
    dtype = "string"
)

And these are the different query results:

>>> df.query("fruit != ''")

    fruit
0   apple
1  banana
2  cherry
4    <NA>

>>> df[df.fruit != '']

    fruit
0   apple
1  banana
2  cherry

Why does the first query return the row containing <NA> and the other query doesn't?


Solution

  • After a deeper investigation, I concluded that there is a clear implementation difference between the query method and indexing.

    I looked at the code, and pandas.query uses the eval method. This method can use two different parsers: pandas and python.

    The pandas parser is the default and brings the result you got including <NA>.

    Unlikely, (I don't know why) the python parser does not implement == and !=; but you can use df.query("fruit > ''", parser="python") and see that the result is what you expect.