[SOLVED] Why does read_csv give me a timezone warning?

Why does read_csv give me a timezone warning?

I try reading a CSV file using pandas and get a warning I do not understand:

Lib\site-packages\dateutil\parser\_parser.py:1207: UnknownTimezoneWarning: tzname B identified but not understood.  Pass `tzinfos` argument in order to correctly return a timezone-aware datetime.  In a future version, this will raise an exception.
  warnings.warn("tzname {tzname} identified but not understood.  "

I do nothing special, just pd.read_csv with parse_dates=True. I see no B that looks like a timezone anywhere in my data. What does the warning mean?

A minimal reproducible example is the following:

import io
import pandas as pd
pd.read_csv(io.StringIO('x\n1A2B'), index_col=0, parse_dates=True)

Why does pandas think 1A2B is a datetime?!

To solve this, I tried adding dtype={'x': str} to force the column into a string. But I keep getting the warning regardless...

Solution

It turns out 1A2B is being interpreted as "1 AM on day 2 of the current month, timezone B". By default, read_csv uses dateutil to detect datetime values (date_parser=):

import dateutil.parser
dateutil.parser.parse('1A2B')

Apart from the warning, this returns (today):

datetime.datetime(2023, 1, 2, 1, 0)

And B is not a valid timezone specifier indeed.

Why adding dtype doesn't help stays to be investigated.

I did find a simple hack that works:

import dateutil.parser
def dateparse(self, timestr, default=None, ignoretz=False, tzinfos=None, **kwargs):
    return self._parse(timestr, **kwargs)
dateutil.parser.parser.parse = dateparse  # Monkey patch; hack!

This prevents using the current day/month/year as defaults, rendering the value invalid as a datetime as expected.