pythonnumpypandasdataframeinclusion

Membership testing of floats in Pandas int64 dataframe produces unexpected result


I have a Pandas dataframe, named "impression_data," which includes a column called "site.id," like this:

   >>> impression_data['site.id']

0      62
1     189
2     191
3      62
...

Each item in this column has the datatype numpy.int64, like this:

>>> for i in impression_data['site.id']:
    print type(i)

<type 'numpy.int64'>
<type 'numpy.int64'>
<type 'numpy.int64'>
...

And as expected, membership testing works well so long as I test integers:

>>> 62 in impression_data['site.id']
True

But here's the unexpected result: I was under the impression that a column of np.int64's ought not to include any decimal values whatsoever. Apparently I'm wrong. What's going on here?

>>> 62.5 in impression_data['site.id']
True

Edit 1: All values in the column ought to be integers by construction. For completeness, I have also performed the following casting operation and encountered no errors:

impression_data['site.id'] = impression_data['site.id'].astype('int')

As per @BremBam's suggestions in the comments, I tried

impression_data['site.id'].map(type).unique()

which produces

[<type 'numpy.int64'>]

A minimal example and the real datafile I'm working with are here https://dl.dropboxusercontent.com/u/28347262/SE%20Pandas%20Int64%20Membership%20Testing/cm_impression.csv

and here

https://dl.dropboxusercontent.com/u/28347262/SE%20Pandas%20Int64%20Membership%20Testing/ExampleCode.py


Solution

  • This is a bug in pandas. The value is cast to the type of the index before the containment test is done, so 62.5 is converted to 62. (Note that in for a Series checks whether the value is in the index, not the values.)

    I believe you can get what you want by doing 62.5 in impression_data.values.