Subclassing Pandas classes seems a common need, but I could not find references on the subject. (It seems that Pandas developers are still working on it: Easier subclassing #60.)
There are some SO questions on the subject, but I am hoping that someone here can provide a more systematic account on the current best way to subclass pandas.DataFrame
that satisfies two general requirements:
(And are there any significant differences for subclassing pandas.Series?)
Code for subclassing pd.DataFrame
:
import numpy as np
import pandas as pd
class MyDF(pd.DataFrame):
# how to subclass pandas DataFrame?
pass
mydf = MyDF(np.random.randn(3,4), columns=['A','B','C','D'])
print(type(mydf)) # <class '__main__.MyDF'>
# Requirement 1: Instances of MyDF, when calling standard methods of DataFrame,
# should produce instances of MyDF.
mydf_sub = mydf[['A','C']]
print(type(mydf_sub)) # <class 'pandas.core.frame.DataFrame'>
# Requirement 2: Attributes attached to instances of MyDF, when calling standard
# methods of DataFrame, should still attach to the output.
mydf.myattr = 1
mydf_cp1 = MyDF(mydf)
mydf_cp2 = mydf.copy()
print(hasattr(mydf_cp1, 'myattr')) # False
print(hasattr(mydf_cp2, 'myattr')) # False
There is now an official guide on how to subclass Pandas data structures, which includes DataFrame as well as Series.
The guide mentions this subclassed DataFrame from the Geopandas project as a good example.
As in HYRY's answer, it seems there are two things you're trying to accomplish:
_constructor
property which should return your type._metadata
attribute.Here's an example:
class SubclassedDataFrame(DataFrame):
_metadata = ['added_property']
added_property = 1 # This will be passed to copies
@property
def _constructor(self):
return SubclassedDataFrame