I'm subclassing pandas DataFrame
in a project of mine. Most pandas
operations preserve the subclass type, but df.groupby().agg()
does not. Is this a bug? Is there a known workaround?
import pandas as pd
class MySeries(pd.Series):
pass
class MyDataFrame(pd.DataFrame):
@property
def _constructor(self):
return MyDataFrame
_constructor_sliced = MySeries
MySeries._constructor_expanddim = MyDataFrame
df = MyDataFrame({"a": reversed(range(10)), "b": list('aaaabbbccc')})
print(type(df.groupby("b").sum()))
# <class '__main__.MyDataFrame'>
print(type(df.groupby("b").agg({"a": "sum"})))
# <class 'pandas.core.frame.DataFrame'>
It looks like there was an issue (described here) that fixed subclassing for df.groupby, but as far as I can tell df.groupby().agg() was missed. I'm using pandas version 2.0.3
.
It turns out that groupby().agg()
combines Series to build a DataFrame, so the subclassed Series constructor needs to be properly defined. See this documentation.
The following code runs with no errors:
import pandas as pd
class MySeries(pd.Series):
@property
def _constructor(self):
return MySeries
@property
def _constructor_expanddim(self):
return MyDataFrame
class MyDataFrame(pd.DataFrame):
@property
def _constructor(self):
return MyDataFrame
@property
def _constructor_sliced(self):
return MySeries
df = MyDataFrame({"a": reversed(range(10)), "b": list('aaaabbbccc')})
assert isinstance(df.groupby("b").agg({"a": "sum"}), MyDataFrame)