pythonpandasnumpydataframedtype

Preserving dtypes when extracting a row from a pandas DataFrame


Extracting a single row from a pandas DataFrame (e.g. using .loc or .iloc) yields a pandas Series. However, when dealing with heterogeneous data in the DataFrame (i.e. the DataFrame’s columns are not all the same dtype), this causes all the values from the different columns in the row to be coerced into a single dtype, because a Series can only have one dtype. Here is a simple example to show what I mean:

import numpy
import pandas

a = numpy.arange(5, dtype='i8')
b = numpy.arange(5, dtype='u8')**2
c = numpy.arange(5, dtype='f8')**3
df = pandas.DataFrame({'a': a, 'b': b, 'c': c})
df.dtypes
# a      int64
# b     uint64
# c    float64
# dtype: object
df
#    a   b     c
# 0  0   0   0.0
# 1  1   1   1.0
# 2  2   4   8.0
# 3  3   9  27.0
# 4  4  16  64.0
df.loc[2]
# a    2.0
# b    4.0
# c    8.0
# Name: 2, dtype: float64

All values in df.loc[2] have been converted to float64.

Is there a good way to extract a row without incurring this type conversion? I could imagine e.g. returning a numpy structured array, but I don’t see a hassle-free way of creating such an array.


Solution

  • Another approach (but it feels slightly hacky):

    Instead of using an integer with loc or iloc, you can use a slicer with length 1. This returns a DataFrame with length 1, so iloc[0] contains your data. e.g

    In[1] : row2 = df[2:2+1]
    In[2] : type(row)
    Out[2]: pandas.core.frame.DataFrame
    In[3] : row2.dtypes
    Out[3]: 
    a      int64
    b     uint64
    c    float64
    In[4] : a2 = row2.a.iloc[0]
    In[5] : type(a2)
    Out[5]: numpy.int64
    In[6] : c2 = row2.c.iloc[0]
    In[7] : type(c2)
    Out[7]: numpy.float64
    

    To me this feels preferable to converting the data types twice (once during row extraction, and again afterwards), and clearer than referring to the original DataFrame multiple times with the same row specification (which could be computationally expensive).

    I think it would be better if pandas had a DataFrameRow type for this siutation.