I want to create a class derived from pandas.DataFrame
with a slightly different __init__()
. I'll store some additional data in new attributes and finally call DataFrame.__init__()
.
from pandas import DataFrame
class DataFrameDerived(DataFrame):
def __init__(self, *args, **kwargs):
self.derived = True
super().__init__(*args, **kwargs)
DataFrameDerived({'a':[1,2,3]})
This code gives the following error when creating the new attribute (self.derived = True
):
RecursionError: maximum recursion depth exceeded while calling a Python object
It is possible, but the implementation isn't very open to extension. Indeed, the official docs suggest using alternatives. The implementation of pd.DataFrame
is complex, involving multiple inheritance with various mixins, and also, it uses the various attribute setting/getting hooks, like __getattr__
and __setattr__
, to among other things, provide syntactic sugar like using df.some_column
and df.some_colum = whatever
to work without using the df['some_column']
syntax. If you look at the stack trace, you can see that something is going on with __setattr__
:
RecursionError Traceback (most recent call last)
Cell In[1], line 8
5 self.derived = True
6 super().__init__(*args, **kwargs)
----> 8 DataFrameDerived({'a':[1,2,3]})
Cell In[1], line 5, in DataFrameDerived.__init__(self, *args, **kwargs)
4 def __init__(self, *args, **kwargs):
----> 5 self.derived = True
6 super().__init__(*args, **kwargs)
File ~/miniconda3/envs/py311/lib/python3.11/site-packages/pandas/core/generic.py:6014, in NDFrame.__setattr__(self, name, value)
6012 else:
6013 try:
-> 6014 existing = getattr(self, name)
6015 if isinstance(existing, Index):
6016 object.__setattr__(self, name, value)
File ~/miniconda3/envs/py311/lib/python3.11/site-packages/pandas/core/generic.py:5986, in NDFrame.__getattr__(self, name)
5976 """
5977 After regular attribute access, try looking up the name
5978 This allows simpler access to columns for interactive use.
5979 """
5980 # Note: obj.x will always call obj.__getattribute__('x') prior to
5981 # calling obj.__getattr__('x').
5982 if (
5983 name not in self._internal_names_set
5984 and name not in self._metadata
5985 and name not in self._accessors
-> 5986 and self._info_axis._can_hold_identifiers_and_holds_name(name)
5987 ):
5988 return self[name]
5989 return object.__getattribute__(self, name)
Knowing this, one might blindly just use object.__setattr__
instead, to bypass this:
In [1]: from pandas import DataFrame
...:
...: class DataFrameDerived(DataFrame):
...: def __init__(self, *args, **kwargs):
...: object.__setattr__(self, 'derived', True)
...: super().__init__(*args, **kwargs)
...:
...: DataFrameDerived({'a':[1,2,3]})
Out[1]:
a
0 1
1 2
2 3
But again, without really understanding the implementation, you are just crossing your fingers and hoping "it works". Which it may. But as noted in the linked docs, you are possibly also going to want to override the "constructor" methods, so that your data frame type will return data frames of it's own type when using dataframe methods.
Instead of using inheritance, an alternative is to instead register other accessor namespaces.. This is one simpler method to extend pandas, if that works for you.
Without knowing more details about what exactly you are trying to accomplish, it is difficult to suggest the best way forward. But you should definitely start by reading the whole of those docs I've linked to on Extending Pandas