When I do an operation in STATA, for example removing duplicated rows, it will tell me the number of rows removed, for instance:
. sysuse auto.dta
(1978 automobile data)
. drop if mpg<15
(8 observations deleted)
. drop if rep78==.
(4 observations deleted)
For the tidyverse, the package tidylog implements a similar feature, providing feedback on the operation (e.g. for a join, number of joined and unjoined rows, for a filter, nubmer of removed rows, etc.), with the little disadvantage that you will lose the autocompletion of your editor, as it wraps tidyverse functions with definitions like filter(...)
, to accomodate for the fact that the tidyverse upstream definition could change over time.
Is there something similar for pandas?
I found pandas-log but seems abandoned.
I ended up with the following, which seem sufficient for my use case:
import pandas
import logging
import pandas_flavor as pf
@pf.register_dataframe_method
def startlog(self):
obj = SubclassedDataFrame2(self)
obj.initial_df = self
return obj
@pf.register_dataframe_method
def midlog(self):
return self.endlog().startlog()
@pf.register_dataframe_method
def endlog(self):
if(self.shape != self.initial_df.shape):
nrow0, ncol0 = self.initial_df.shape
nrow1, ncol1 = self.shape
dr = nrow0 - nrow1
dc = ncol0 - ncol1
msg = ""
if dr > 0:
msg += f"Removed {dr:,d}/{nrow0:,d} ({dr/nrow0:.2%}) rows. "
elif dr < 0:
msg += f"Added {dr:,d}/{nrow0:,d} ({dr/nrow0:.2%}) rows. "
if dc > 0:
msg += f"Removed {dc:,d}/{ncol0:,d} ({dc/ncol0:.2%}) columns."
elif dc < 0:
msg += f"Added {dc:,d}/{ncol0:,d} ({dc/ncol0:.2%}) columns."
logging.getLogger("pandas-utils").info(msg)
else:
nchanged = (self != self.initial_df)
# missings are different than themselves
nchanged[self.isna() & self.initial_df.isna()] = False
nchanged = nchanged.sum().sum()
ntot = self.initial_df.size
logging.getLogger("pandas-utils").info(f"Changed {nchanged:,d}/{ntot:,d} ({nchanged/ntot:.2%}) values")
del self.initial_df
return self
To be used like this:
patentsdf.startlog().dropna(subset=["person_address", "person_ctry_code"], how="any").endlog()
This works for all operations that modify rows or columns, like dropping NAs or duplicates.
For merge, just call pd.merge
with indicator=True
and then value_counts()
the resulting _merge
column.