pythonpandasinitderived-classclass-attributes

How can I add new attributes to a pandas.DataFrame derived class?


I want to create a class derived from pandas.DataFrame with a slightly different __init__(). I'll store some additional data in new attributes and finally call DataFrame.__init__().

from pandas import DataFrame

class DataFrameDerived(DataFrame):
    def __init__(self, *args, **kwargs):
        self.derived = True
        super().__init__(*args, **kwargs)

DataFrameDerived({'a':[1,2,3]})

This code gives the following error when creating the new attribute (self.derived = True):

RecursionError: maximum recursion depth exceeded while calling a Python object


Solution

  • It is possible, but the implementation isn't very open to extension. Indeed, the official docs suggest using alternatives. The implementation of pd.DataFrame is complex, involving multiple inheritance with various mixins, and also, it uses the various attribute setting/getting hooks, like __getattr__ and __setattr__, to among other things, provide syntactic sugar like using df.some_column and df.some_colum = whatever to work without using the df['some_column'] syntax. If you look at the stack trace, you can see that something is going on with __setattr__:

    RecursionError                            Traceback (most recent call last)
    Cell In[1], line 8
          5         self.derived = True
          6         super().__init__(*args, **kwargs)
    ----> 8 DataFrameDerived({'a':[1,2,3]})
    
    Cell In[1], line 5, in DataFrameDerived.__init__(self, *args, **kwargs)
          4 def __init__(self, *args, **kwargs):
    ----> 5     self.derived = True
          6     super().__init__(*args, **kwargs)
    
    File ~/miniconda3/envs/py311/lib/python3.11/site-packages/pandas/core/generic.py:6014, in NDFrame.__setattr__(self, name, value)
       6012 else:
       6013     try:
    -> 6014         existing = getattr(self, name)
       6015         if isinstance(existing, Index):
       6016             object.__setattr__(self, name, value)
    
    File ~/miniconda3/envs/py311/lib/python3.11/site-packages/pandas/core/generic.py:5986, in NDFrame.__getattr__(self, name)
       5976 """
       5977 After regular attribute access, try looking up the name
       5978 This allows simpler access to columns for interactive use.
       5979 """
       5980 # Note: obj.x will always call obj.__getattribute__('x') prior to
       5981 # calling obj.__getattr__('x').
       5982 if (
       5983     name not in self._internal_names_set
       5984     and name not in self._metadata
       5985     and name not in self._accessors
    -> 5986     and self._info_axis._can_hold_identifiers_and_holds_name(name)
       5987 ):
       5988     return self[name]
       5989 return object.__getattribute__(self, name)
    

    Knowing this, one might blindly just use object.__setattr__ instead, to bypass this:

    In [1]: from pandas import DataFrame
       ...:
       ...: class DataFrameDerived(DataFrame):
       ...:     def __init__(self, *args, **kwargs):
       ...:         object.__setattr__(self, 'derived', True)
       ...:         super().__init__(*args, **kwargs)
       ...:
       ...: DataFrameDerived({'a':[1,2,3]})
    Out[1]:
       a
    0  1
    1  2
    2  3
    

    But again, without really understanding the implementation, you are just crossing your fingers and hoping "it works". Which it may. But as noted in the linked docs, you are possibly also going to want to override the "constructor" methods, so that your data frame type will return data frames of it's own type when using dataframe methods.

    Instead of using inheritance, an alternative is to instead register other accessor namespaces.. This is one simpler method to extend pandas, if that works for you.

    Without knowing more details about what exactly you are trying to accomplish, it is difficult to suggest the best way forward. But you should definitely start by reading the whole of those docs I've linked to on Extending Pandas