pythonpandascsvdatetimestamp

Reading CSV dates with pandas returns datetime instead of Timestamp


Problem description:

I'm trying to read historical stock prices from CSV format into pandas Dataframe, but one funny thing I have noticed so far - when reading certain row numbers, the dates column type is changed from pandas.Timestamp into datetime - how does that work? How can I read pandas.Timestamp then?

Minimum reproduction example:

I have inspected my CSV file, and here is a minimal needed data example from it.

import pandas as pd
file = open('temp.csv', 'w')
file.write(
    """Local time,Open,High,Low,Close,Volume
28.02.2014 02:00:00.000 GMT+0200,1.37067,1.38250,1.36943,1.38042,176839.0313
01.04.2014 03:00:00.000 GMT+0300,1.37742,1.38156,1.37694,1.37937,95386.0703""")
file.close()

data = pd.read_csv('temp.csv', parse_dates = ["Local time"])
print(type(data['Local time'][0]))

Result: <class 'datetime.datetime'>

VERSUS

import pandas as pd
file = open('temp.csv', 'w')
file.write(
    """Local time,Open,High,Low,Close,Volume
28.02.2014 02:00:00.000 GMT+0200,1.37067,1.38250,1.36943,1.38042,176839.0313""")
file.close()

data = pd.read_csv('temp.csv', parse_dates = ["Local time"])
print(type(data['Local time'][0]))

file = open('temp.csv', 'w')
file.write(
    """Local time,Open,High,Low,Close,Volume
01.04.2014 03:00:00.000 GMT+0300,1.37742,1.38156,1.37694,1.37937,95386.0703""")
file.close()

data = pd.read_csv('temp.csv', parse_dates = ["Local time"])
print(type(data['Local time'][0]))

file = open('temp.csv', 'w')
file.write(
    """Local time,Open,High,Low,Close,Volume
02.03.2014 02:00:00.000 GMT+0200,1.37620,1.37882,1.37586,1.37745,5616.04
03.03.2014 02:00:00.000 GMT+0200,1.37745,1.37928,1.37264,1.37357,136554.6563
04.03.2014 02:00:00.000 GMT+0200,1.37356,1.37820,1.37211,1.37421,124863.8203""")
file.close()

data = pd.read_csv('temp.csv', parse_dates = ["Local time"])
print(type(data['Local time'][0]))

Result: <class 'pandas._libs.tslibs.timestamps.Timestamp'>

Result: <class 'pandas._libs.tslibs.timestamps.Timestamp'>

Result: <class 'pandas._libs.tslibs.timestamps.Timestamp'>

Versions:

pandas==1.2.3 pandas-datareader==0.9.0

Summary:

I need to read pandas.Timestamp because of some latter data manipulations, not datetime, and have no ideas what's wrong here - hope you guys can help...

I have created a GitHub issue as well, but it's not triaged yet.


Solution

  • You can specify which date_parser function to be used:

    data = pd.read_csv('temp.csv', 
                       parse_dates = ["Local time"],
                       date_parser=pd.Timestamp)
    

    Output:

    >>> data
                      Local time     Open     High      Low    Close       Volume
    0  2014-02-03 02:00:00-02:00  1.37620  1.37882  1.37586  1.37745    5616.0400
    1  2014-03-03 02:00:00-03:00  1.37745  1.37928  1.37264  1.37357  136554.6563
    2  2014-04-03 02:00:00-02:00  1.37356  1.37820  1.37211  1.37421  124863.8203
    
    >>> type(data['Local time'][0])
    <class 'pandas._libs.tslibs.timestamps.Timestamp'>
    

    By my observation pandas automatically parses each entry as datetime when timezone are different for individual observation.

    The above should work if you really need to use pd.Timestamp.

    Running the above however also gives me a FutureWarning, which I researched and found to be harmless as of now.

    EDIT

    After a bit more research:

    pandas tries to convert a date type column to DatetimeIndex for more efficiency in datetime based operations. But for this pandas needs to have a common timezone for the entire column.

    On explicitly trying to convert to pd.DatetimeIndex

    >>> data
                      Local time     Open     High      Low    Close       Volume
    0  2014-02-03 02:00:00-02:00  1.37620  1.37882  1.37586  1.37745    5616.0400
    1  2014-03-03 02:00:00-03:00  1.37745  1.37928  1.37264  1.37357  136554.6563
    2  2014-04-03 02:00:00-04:00  1.37356  1.37820  1.37211  1.37421  124863.8203
    
    >>> pd.DatetimeIndex(data['Local time'])
    
    ValueError: Array must be all same time zone
    
    During handling of the above exception, another exception occurred:
    
    ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True
    

    So when converting to DatetimeIndex fails pandas then keeps the data as strings (dtype : object) internally and individual entries to be processed as datetime.

    Documentation recommends that if timezones in the data are different specify UTC=True, so the timezone would be set as UTC and time values would be changed accordingly.

    From Documentation:

    pandas cannot natively represent a column or index with mixed timezones. If your CSV file contains columns with a mixture of timezones, the default result will be an object-dtype column with strings, even with parse_dates.

    To parse the mixed-timezone values as a datetime column, pass a partially-applied to_datetime() with utc=True

    In a data that already has the same timezone DatetimeIndex works seamlessly:

    >>> data
                     Local time     Open     High      Low    Close       Volume
    0 2014-02-03 02:00:00-02:00  1.37620  1.37882  1.37586  1.37745    5616.0400
    1 2014-03-03 02:00:00-02:00  1.37745  1.37928  1.37264  1.37357  136554.6563
    2 2014-04-03 02:00:00-02:00  1.37356  1.37820  1.37211  1.37421  124863.8203
    
    
    >>> pd.DatetimeIndex(data['Local time'])
    
    DatetimeIndex(['2014-02-03 02:00:00-02:00', '2014-03-03 02:00:00-02:00',
                   '2014-04-03 02:00:00-02:00'],
                  dtype='datetime64[ns, pytz.FixedOffset(-120)]', name='Local time', freq=None)
    
    >>> type(pd.DatetimeIndex(data['Local time'])[0])
    
    <class 'pandas._libs.tslibs.timestamps.Timestamp'>
    

    References: