I am using pandas
to operate on dataframes and logging
to log intermediate results, as well as warnings and errors, into a separate log file. I need to also print into the same log file a few intermediate dataframes. Specifically, I want to:
logging
messages (to ensure easier debugging and avoid writing many intermediate files, as would be the case with calls to to_csv
with a file destination),logging
levels, such as DEBUG
or INFO
, sharing this with the verbosity of other logging messages (including those that are not related to dataframes).240102 10:58:20 INFO:
.The best I could come up is the code below, which is a bit too verbose. Is there a simpler and more pythonic way to log a dataframe slice?
Note:
Please include an example of usage.
Example:
import io
import logging
import pandas as pd
# Print into log this many lines of several intermediate dataframes,
# set to 20 or so:
MAX_NUM_DF_LOG_LINES = 4
logging.basicConfig(
datefmt = '%y%m%d %H:%M:%S',
format = '%(asctime)s %(levelname)s: %(message)s')
logger = logging.getLogger(__name__)
# Or logging.DEBUG, etc:
logger.setLevel(level = logging.INFO)
# Example of a simple log message:
logger.info('Reading input.')
TESTDATA="""
enzyme regions N length
AaaI all 10 238045
AaaI all 20 170393
AaaI captured 10 292735
AaaI captured 20 229824
AagI all 10 88337
AagI all 20 19144
AagI captured 10 34463
AagI captured 20 19220
"""
df = pd.read_csv(io.StringIO(TESTDATA), sep='\s+')
# ...some code....
# Example of a log message with a chunk of a dataframe, here, using
# `head` (but this can be another method that slices a dataframe):
logger.debug('less important intermediate results: df:')
for line in df.head(MAX_NUM_DF_LOG_LINES).to_string().splitlines():
logger.debug(line)
# ...more code....
logger.info('more important intermediate results: df:')
for line in df.head(MAX_NUM_DF_LOG_LINES).to_string().splitlines():
logger.info(line)
# ...more code....
Prints:
240102 10:58:20 INFO: Reading input.
240102 10:58:20 INFO: more important intermediate results: df:
240102 10:58:20 INFO: enzyme regions N length
240102 10:58:20 INFO: 0 AaaI all 10 238045
240102 10:58:20 INFO: 1 AaaI all 20 170393
240102 10:58:20 INFO: 2 AaaI captured 10 292735
240102 10:58:20 INFO: 3 AaaI captured 20 229824
Related:
None of this accomplishes what I try to do, but it is getting closer:
print
from inside a list comprehension and then discards the result: "Do note that the latter only works on py2 due to map being lazy; you can do [logger.info(line) for line in 'line 1\nline 2\nline 3'.splitlines()]
on py3. –
Kyuuhachi, Jun 22, 2021 at 16:30".240102 12:27:19 INFO: dataframe head - enzyme regions N length
0 AaaI all 10 238045
1 AaaI all 20 170393
2 AaaI captured 10 292735
...
What you are looking for is a custom Formatter
. Using it will be more Pythonic. It provides better flexibility and code readability, as it is part of Python logging system.
class DataFrameFormatter(logging.Formatter):
def __init__(self, fmt: str, n_rows: int = 4) -> None:
self.n_rows = n_rows
super().__init__(fmt)
def format(self, record: logging.LogRecord) -> str:
if isinstance(record.msg, pd.DataFrame):
s = ''
if hasattr(record, 'n_rows'):
self.n_rows = record.n_rows
lines = record.msg.head(self.n_rows).to_string().splitlines()
if hasattr(record, 'header'):
record.msg = record.header.strip()
s += super().format(record) + '\n'
for line in lines:
record.msg = line
s += super().format(record) + '\n'
return s.strip()
return super().format(record)
formatter = DataFrameFormatter('%(asctime)s %(levelname)-8s %(message)s', n_rows=4)
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
ch = logging.StreamHandler()
ch.setFormatter(formatter)
logger.addHandler(ch)
df = pd.DataFrame({'a' : [1,2,3,4,5], 'bb': [10, 20, 30, 40 ,50]})
logger.info(df, extra={'header': "this is a header line"})
logger.debug('foo')
logger.info(df, extra={'n_rows': 2})
This will produce following log:
2024-01-09 15:09:53,384 INFO this is a header line
2024-01-09 15:09:53,384 INFO a bb
2024-01-09 15:09:53,384 INFO 0 1 10
2024-01-09 15:09:53,384 INFO 1 2 20
2024-01-09 15:09:53,384 INFO 2 3 30
2024-01-09 15:09:53,384 INFO 3 4 40
2024-01-09 15:09:53,385 DEBUG foo
2024-01-09 15:09:53,385 INFO a bb
2024-01-09 15:09:53,385 INFO 0 1 10
2024-01-09 15:09:53,385 INFO 1 2 20
This way you can easily control header
and n_rows
through the extra
entry. And if you will not provide them, the default values will be used.