Extracting a single row from a pandas
DataFrame
(e.g. using .loc
or .iloc
) yields a pandas
Series
. However, when dealing with heterogeneous data in the DataFrame
(i.e. the DataFrame
’s columns are not all the same dtype), this causes all the values from the different columns in the row to be coerced into a single dtype, because a Series
can only have one dtype. Here is a simple example to show what I mean:
import numpy
import pandas
a = numpy.arange(5, dtype='i8')
b = numpy.arange(5, dtype='u8')**2
c = numpy.arange(5, dtype='f8')**3
df = pandas.DataFrame({'a': a, 'b': b, 'c': c})
df.dtypes
# a int64
# b uint64
# c float64
# dtype: object
df
# a b c
# 0 0 0 0.0
# 1 1 1 1.0
# 2 2 4 8.0
# 3 3 9 27.0
# 4 4 16 64.0
df.loc[2]
# a 2.0
# b 4.0
# c 8.0
# Name: 2, dtype: float64
All values in df.loc[2]
have been converted to float64
.
Is there a good way to extract a row without incurring this type conversion? I could imagine e.g. returning a numpy
structured array, but I don’t see a hassle-free way of creating such an array.
Another approach (but it feels slightly hacky):
Instead of using an integer with loc
or iloc
, you can use a slicer with length 1. This returns a DataFrame with length 1, so iloc[0]
contains your data. e.g
In[1] : row2 = df[2:2+1]
In[2] : type(row)
Out[2]: pandas.core.frame.DataFrame
In[3] : row2.dtypes
Out[3]:
a int64
b uint64
c float64
In[4] : a2 = row2.a.iloc[0]
In[5] : type(a2)
Out[5]: numpy.int64
In[6] : c2 = row2.c.iloc[0]
In[7] : type(c2)
Out[7]: numpy.float64
To me this feels preferable to converting the data types twice (once during row extraction, and again afterwards), and clearer than referring to the original DataFrame multiple times with the same row specification (which could be computationally expensive).
I think it would be better if pandas had a DataFrameRow type for this siutation.