I've recently stumbled upon a new awesome pendulum
library for easier work with datetimes.
In pandas
, there is this handy to_datetime()
method allowing to convert series and other objects to datetimes:
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')
What would be the canonical way to create a custom to_<something>
method -
in this case to_pendulum()
method which would be able to convert Series of date strings directly to Pendulum
objects?
This may lead to Series
having various interesting capabilities like, for instance, converting a series of date strings to a series of "offsets from now" - human datetime diffs.
What would be the canonical way to create a custom
to_<something>
method - in this caseto_pendulum()
method which would be able to convert Series of date strings directly toPendulum
objects?
After looking through the API a bit, I must say I'm impressed with what they've done. Unfortunately, I don't think Pendulum
and pandas
can work together (at least, with the current latest version - v0.21
).
The most important reason is that pandas
does not natively support Pendulum
as a datatype. All the natively supported datatypes (np.int
, np.float
and np.datetime64
) all support vectorisation in some form. You are not going to get a shred of performance improvement using a dataframe over, say, a vanilla loop and list. If anything, calling apply
on a Series
with Pendulum
objects is going to be slower (because of all the API overheads).
Another reason is that Pendulum
is a subclass of datetime
-
from datetime import datetime
isinstance(pendulum.now(), datetime)
True
This is important, because, as mentioned above, datetime
is a supported datatype, so pandas will attempt to coerce datetime
to pandas' native datetime format - Timestamp
. Here's an example.
print(s)
0 2017-11-09 18:43:45
1 2017-11-09 20:15:27
2 2017-11-09 22:29:00
3 2017-11-09 23:42:34
4 2017-11-10 00:09:40
5 2017-11-10 00:23:14
6 2017-11-10 03:32:17
7 2017-11-10 10:59:24
8 2017-11-10 11:12:59
9 2017-11-10 13:49:09
s = s.apply(pendulum.parse)
s
0 2017-11-09 18:43:45+00:00
1 2017-11-09 20:15:27+00:00
2 2017-11-09 22:29:00+00:00
3 2017-11-09 23:42:34+00:00
4 2017-11-10 00:09:40+00:00
5 2017-11-10 00:23:14+00:00
6 2017-11-10 03:32:17+00:00
7 2017-11-10 10:59:24+00:00
8 2017-11-10 11:12:59+00:00
9 2017-11-10 13:49:09+00:00
Name: timestamp, dtype: datetime64[ns, <TimezoneInfo [UTC, GMT, +00:00:00, STD]>]
s[0]
Timestamp('2017-11-09 18:43:45+0000', tz='<TimezoneInfo [UTC, GMT, +00:00:00, STD]>')
type(s[0])
pandas._libs.tslib.Timestamp
So, with some difficulty (involving dtype=object
), you could load Pendulum
objects into dataframes. Here's how you'd do that -
v = np.vectorize(pendulum.parse)
s = pd.Series(v(s), dtype=object)
s
0 2017-11-09T18:43:45+00:00
1 2017-11-09T20:15:27+00:00
2 2017-11-09T22:29:00+00:00
3 2017-11-09T23:42:34+00:00
4 2017-11-10T00:09:40+00:00
5 2017-11-10T00:23:14+00:00
6 2017-11-10T03:32:17+00:00
7 2017-11-10T10:59:24+00:00
8 2017-11-10T11:12:59+00:00
9 2017-11-10T13:49:09+00:00
s[0]
<Pendulum [2017-11-09T18:43:45+00:00]>
However, this is essentially useless, because calling any pendulum
method (via apply
) will now not only be super slow, but will also end up in the result being coerced to Timestamp
again, an exercise in futility.