pythonpandasdataframepython-class

Pandas "DataFrame"s as class properties. How should I initialize them in class constructor __init__()?


I have a class which will manage multiple pandas Data Frames. The Data frames are class properties. I have initiated every Data Frame in the class constructor and assigned an empty Data Frame to them (Because Data Frames are not available at the time of creating a class instance and some of them will be created using data from other Data Frames of this class) like this:

class MyClass:
    """
    Handle data cached in csv files
    """

    def __init__(self):
        """
        initialize MyClass class
        """

        self._parent_df = pd.DataFrame()
        self._child_df = pd.DataFrame()
        self.stored_data_df = pd.DataFrame()
        self.personnel_data_df = pd.DataFrame()
        self.salary_df = pd.DataFrame()
        self.settings = {}
        self._last_update = self._get_last_upd()
        self._last_events_df = pd.DataFrame()

    @property
    def parent_df(self):
        if not self._parent_df.empty
            return self._parent_df
        else:
            raise AttributeError()

    @parent_df.setter
    def parent_df(self, value: pd.DataFrame):
        self._parent_df = value

    # more properties getting and setting DataFrames
    
    # … and some methods working with data in multiple DataFrames

What is the best practice to write this class? Since Initializing and assigning Data Frames are resource heavy tasks, is this approach considered 'Pythonic'? Should I avoid defining them in init or assign 'None' as initial value instead of empty Data Frames? self._parent_df = None Also, if anybody knows any good open source package that has a class working like this, I'll be Happy to look at.


Solution

  • Well, I believe this approach is quite expensive Since, with the initialization of the object, you are doing expensive initialization (plenty of data frames, etc.).

    The best approach is called lazy initialization, in which the getter of a property is responsible for initializing the property itself (in case it wasn't initialized). Sample Code:

    class MyClass:
        def __init__(self):
            self._value = None
    
        @property
        def value(self):
            if self._value is None:
                self._value = expensive_initialization()
            return self._value
    
    my_instance = MyClass()
    

    When we access my_instance.value for the first time, the my_instance._valuewill trigger the call for the expensive_inistialization (whatever it is for that property).
    This way, you trigger the initialization for each property on its first-time need.

    for packages, there is lazy_python, for further explanation you have this nice article for someone who went deep into explaining this approach. How to Create Lazy Attributes to Improve Performance in Python..

    I hope this helps!