pythonpandasdatetimetimezonedatetimeindex

Creating pandas DatetimeIndex in Dataframe from DST aware datetime objects


From an online API I gather a series of data points, each with a value and an ISO timestamp. Unfortunately I need to loop over them, so I store them in a temporary dict and then create a pandas dataframe from that and set the index to the timestamp column (simplified example):

from datetime import datetime
import pandas


input_data = [
    '2019-09-16T06:44:01+02:00',
    '2019-11-11T09:13:01+01:00',
]

data = []
for timestamp in input_data:
    _date = datetime.fromisoformat(timestamp)

    data.append({'time': _date})

pd_data = pandas.DataFrame(data).set_index('time')

As long as all timestamps are in the same timezone and DST/non-DST everything works fine, and, I get a Dataframe with a DatetimeIndex which I can work on later. However, once two different time-offsets appear in one dataset (above example), I only get an Index, in my dataframe, which does not support any time-based methods.

Is there any way to make pandas accept timezone-aware, differing date as index?


Solution

  • import pandas as pd
    
    # sample data
    input_data = ['2019-09-16T06:44:01+02:00', '2019-11-11T09:13:01+01:00']
    
    # dataframe
    df = pd.DataFrame(input_data, columns=['datetime'])
    
    # separate the offset from the datetime and convert it to a timedelta
    df['offset'] = pd.to_timedelta(df.datetime.str[-6:] + ':00')
    
    # if desired, create a str with the separated datetime
    # converting this to a datetime will lead to AmbiguousTimeError because of overlapping datetimes at 2AM, per the OP
    df['datetime_str'] = df.datetime.str[:-6]
    
    # convert the datetime column to a datetime format without the offset
    df['datetime_utc'] = pd.to_datetime(df.datetime, utc=True)
    
    # display(df)
                        datetime          offset        datetime_str              datetime_utc
    0  2019-09-16T06:44:01+02:00 0 days 02:00:00 2019-09-16 06:44:01 2019-09-16 04:44:01+00:00
    1  2019-11-11T09:13:01+01:00 0 days 01:00:00 2019-11-11 09:13:01 2019-11-11 08:13:01+00:00
    
    print(df.info())
    [out]:
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 2 entries, 0 to 1
    Data columns (total 4 columns):
     #   Column        Non-Null Count  Dtype              
    ---  ------        --------------  -----              
     0   datetime      2 non-null      object             
     1   offset        2 non-null      timedelta64[ns]    
     2   datetime_str  2 non-null      object             
     3   datetime_utc  2 non-null      datetime64[ns, UTC]
    dtypes: datetime64[ns, UTC](1), object(2), timedelta64[ns](1)
    memory usage: 192.0+ bytes
    
    # convert to local timezone
    df.datetime_utc.dt.tz_convert('US/Pacific')
    
    [out]:
    0   2019-09-15 21:44:01-07:00
    1   2019-11-11 00:13:01-08:00
    Name: datetime_utc, dtype: datetime64[ns, US/Pacific]
    

    Other Resources