pythonpandasdatedatetimependulum

Making Pandas work with Pendulum


I've recently stumbled upon a new awesome pendulum library for easier work with datetimes.

In pandas, there is this handy to_datetime() method allowing to convert series and other objects to datetimes:

raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')

What would be the canonical way to create a custom to_<something> method - in this case to_pendulum() method which would be able to convert Series of date strings directly to Pendulum objects?

This may lead to Series having various interesting capabilities like, for instance, converting a series of date strings to a series of "offsets from now" - human datetime diffs.


Solution

  • What would be the canonical way to create a custom to_<something> method - in this case to_pendulum() method which would be able to convert Series of date strings directly to Pendulum objects?

    After looking through the API a bit, I must say I'm impressed with what they've done. Unfortunately, I don't think Pendulum and pandas can work together (at least, with the current latest version - v0.21).

    The most important reason is that pandas does not natively support Pendulum as a datatype. All the natively supported datatypes (np.int, np.float and np.datetime64) all support vectorisation in some form. You are not going to get a shred of performance improvement using a dataframe over, say, a vanilla loop and list. If anything, calling apply on a Series with Pendulum objects is going to be slower (because of all the API overheads).

    Another reason is that Pendulum is a subclass of datetime -

    from datetime import datetime
    
    isinstance(pendulum.now(), datetime)
    True
    

    This is important, because, as mentioned above, datetime is a supported datatype, so pandas will attempt to coerce datetime to pandas' native datetime format - Timestamp. Here's an example.

    print(s)
    
    0     2017-11-09 18:43:45
    1     2017-11-09 20:15:27
    2     2017-11-09 22:29:00
    3     2017-11-09 23:42:34
    4     2017-11-10 00:09:40
    5     2017-11-10 00:23:14
    6     2017-11-10 03:32:17
    7     2017-11-10 10:59:24
    8     2017-11-10 11:12:59
    9     2017-11-10 13:49:09
    
    s = s.apply(pendulum.parse)
    s
    
    0    2017-11-09 18:43:45+00:00
    1    2017-11-09 20:15:27+00:00
    2    2017-11-09 22:29:00+00:00
    3    2017-11-09 23:42:34+00:00
    4    2017-11-10 00:09:40+00:00
    5    2017-11-10 00:23:14+00:00
    6    2017-11-10 03:32:17+00:00
    7    2017-11-10 10:59:24+00:00
    8    2017-11-10 11:12:59+00:00
    9    2017-11-10 13:49:09+00:00
    Name: timestamp, dtype: datetime64[ns, <TimezoneInfo [UTC, GMT, +00:00:00, STD]>]
    
    s[0]
    Timestamp('2017-11-09 18:43:45+0000', tz='<TimezoneInfo [UTC, GMT, +00:00:00, STD]>')
    
    type(s[0])
    pandas._libs.tslib.Timestamp
    

    So, with some difficulty (involving dtype=object), you could load Pendulum objects into dataframes. Here's how you'd do that -

    v = np.vectorize(pendulum.parse)
    s = pd.Series(v(s), dtype=object)
    
    s
    
    0     2017-11-09T18:43:45+00:00
    1     2017-11-09T20:15:27+00:00
    2     2017-11-09T22:29:00+00:00
    3     2017-11-09T23:42:34+00:00
    4     2017-11-10T00:09:40+00:00
    5     2017-11-10T00:23:14+00:00
    6     2017-11-10T03:32:17+00:00
    7     2017-11-10T10:59:24+00:00
    8     2017-11-10T11:12:59+00:00
    9     2017-11-10T13:49:09+00:00
    
    s[0]
    <Pendulum [2017-11-09T18:43:45+00:00]>
    

    However, this is essentially useless, because calling any pendulum method (via apply) will now not only be super slow, but will also end up in the result being coerced to Timestamp again, an exercise in futility.