pythonpandas

Logging operation results in pandas (equivalent of STATA/tidylog)


When I do an operation in STATA, for example removing duplicated rows, it will tell me the number of rows removed, for instance:

. sysuse auto.dta
(1978 automobile data)

. drop if mpg<15
(8 observations deleted)

. drop if rep78==.
(4 observations deleted)

For the tidyverse, the package tidylog implements a similar feature, providing feedback on the operation (e.g. for a join, number of joined and unjoined rows, for a filter, nubmer of removed rows, etc.), with the little disadvantage that you will lose the autocompletion of your editor, as it wraps tidyverse functions with definitions like filter(...), to accomodate for the fact that the tidyverse upstream definition could change over time.

Is there something similar for pandas?

I found pandas-log but seems abandoned.

Related question for R.


Solution

  • I ended up with the following, which seem sufficient for my use case:

    import pandas
    import logging
    import pandas_flavor as pf
    
    
    @pf.register_dataframe_method
    def startlog(self):
        obj = SubclassedDataFrame2(self)
        obj.initial_df = self
        return obj
    
    @pf.register_dataframe_method
    def midlog(self):
        return self.endlog().startlog()
    
    @pf.register_dataframe_method
    def endlog(self):
        if(self.shape != self.initial_df.shape):
            nrow0, ncol0 = self.initial_df.shape
            nrow1, ncol1 = self.shape
            dr = nrow0 - nrow1
            dc = ncol0 - ncol1
            msg = ""
            if dr > 0:
                msg += f"Removed {dr:,d}/{nrow0:,d} ({dr/nrow0:.2%}) rows. "
            elif dr < 0:
                msg += f"Added {dr:,d}/{nrow0:,d} ({dr/nrow0:.2%}) rows. "
            if dc > 0:
                msg += f"Removed {dc:,d}/{ncol0:,d} ({dc/ncol0:.2%}) columns."
            elif dc < 0:
                msg += f"Added {dc:,d}/{ncol0:,d} ({dc/ncol0:.2%}) columns."
            logging.getLogger("pandas-utils").info(msg)
        else:
            nchanged = (self != self.initial_df)
            # missings are different than themselves
            nchanged[self.isna() & self.initial_df.isna()] = False
            nchanged = nchanged.sum().sum()
            ntot = self.initial_df.size
            logging.getLogger("pandas-utils").info(f"Changed {nchanged:,d}/{ntot:,d} ({nchanged/ntot:.2%}) values")
        del self.initial_df
        return self
    

    To be used like this:

    patentsdf.startlog().dropna(subset=["person_address", "person_ctry_code"], how="any").endlog()
    

    This works for all operations that modify rows or columns, like dropping NAs or duplicates.

    For merge, just call pd.merge with indicator=True and then value_counts() the resulting _merge column.