pythonpandasdataframeloggingpretty-print

Log a dataframe using logging and pandas


I am using pandas to operate on dataframes and logging to log intermediate results, as well as warnings and errors, into a separate log file. I need to also print into the same log file a few intermediate dataframes. Specifically, I want to:

The best I could come up is the code below, which is a bit too verbose. Is there a simpler and more pythonic way to log a dataframe slice?

Note:

Please include an example of usage.

Example:

import io
import logging
import pandas as pd

# Print into log this many lines of several intermediate dataframes,
# set to 20 or so:
MAX_NUM_DF_LOG_LINES = 4

logging.basicConfig(
    datefmt = '%y%m%d %H:%M:%S',
    format = '%(asctime)s %(levelname)s: %(message)s')
logger = logging.getLogger(__name__)

# Or logging.DEBUG, etc:
logger.setLevel(level = logging.INFO)

# Example of a simple log message:
logger.info('Reading input.')

TESTDATA="""
enzyme  regions   N   length
AaaI    all       10  238045
AaaI    all       20  170393
AaaI    captured  10  292735
AaaI    captured  20  229824
AagI    all       10  88337
AagI    all       20  19144
AagI    captured  10  34463
AagI    captured  20  19220
"""

df = pd.read_csv(io.StringIO(TESTDATA), sep='\s+')

# ...some code....

# Example of a log message with a chunk of a dataframe, here, using
# `head` (but this can be another method that slices a dataframe):
logger.debug('less important intermediate results: df:')
for line in df.head(MAX_NUM_DF_LOG_LINES).to_string().splitlines():
    logger.debug(line)

# ...more code....

logger.info('more important intermediate results: df:')
for line in df.head(MAX_NUM_DF_LOG_LINES).to_string().splitlines():
    logger.info(line)

# ...more code....

Prints:

240102 10:58:20 INFO: Reading input.
240102 10:58:20 INFO: more important intermediate results: df:
240102 10:58:20 INFO:   enzyme   regions   N  length
240102 10:58:20 INFO: 0   AaaI       all  10  238045
240102 10:58:20 INFO: 1   AaaI       all  20  170393
240102 10:58:20 INFO: 2   AaaI  captured  10  292735
240102 10:58:20 INFO: 3   AaaI  captured  20  229824

Related:

None of this accomplishes what I try to do, but it is getting closer:

240102 12:27:19 INFO: dataframe head -   enzyme   regions   N  length
0   AaaI       all  10  238045
1   AaaI       all  20  170393
2   AaaI  captured  10  292735
...

Solution

  • What you are looking for is a custom Formatter. Using it will be more Pythonic. It provides better flexibility and code readability, as it is part of Python logging system.

    class DataFrameFormatter(logging.Formatter):
        def __init__(self, fmt: str, n_rows: int = 4) -> None:
            self.n_rows = n_rows
            super().__init__(fmt)
            
        def format(self, record: logging.LogRecord) -> str:
            if isinstance(record.msg, pd.DataFrame):
                s = ''
                if hasattr(record, 'n_rows'):
                    self.n_rows = record.n_rows
                lines = record.msg.head(self.n_rows).to_string().splitlines()
                if hasattr(record, 'header'):
                    record.msg = record.header.strip()
                    s += super().format(record) + '\n'
                for line in lines:
                    record.msg = line
                    s += super().format(record) + '\n'
                return s.strip()
            return super().format(record)
            
    formatter = DataFrameFormatter('%(asctime)s %(levelname)-8s %(message)s', n_rows=4)
    logger = logging.getLogger()
    logger.setLevel(logging.DEBUG)
    ch = logging.StreamHandler()
    ch.setFormatter(formatter)
    logger.addHandler(ch)
    
    df = pd.DataFrame({'a' : [1,2,3,4,5], 'bb': [10, 20, 30, 40 ,50]})
    
    logger.info(df, extra={'header': "this is a header line"})
    logger.debug('foo')    
    logger.info(df, extra={'n_rows': 2})
    

    This will produce following log:

    2024-01-09 15:09:53,384 INFO     this is a header line
    2024-01-09 15:09:53,384 INFO        a  bb
    2024-01-09 15:09:53,384 INFO     0  1  10
    2024-01-09 15:09:53,384 INFO     1  2  20
    2024-01-09 15:09:53,384 INFO     2  3  30
    2024-01-09 15:09:53,384 INFO     3  4  40
    2024-01-09 15:09:53,385 DEBUG    foo
    2024-01-09 15:09:53,385 INFO        a  bb
    2024-01-09 15:09:53,385 INFO     0  1  10
    2024-01-09 15:09:53,385 INFO     1  2  20
    

    This way you can easily control header and n_rows through the extra entry. And if you will not provide them, the default values will be used.