I wanted to add/subtract the UTC
offset (usually in hours) to/from the datetime object in polars but I don't seem to see a way to do this. the UTC offset can be dynamic given there's Day Light Saving period comes into play in a calendar year. (e.g., EST/EDT maps to 5/4 hours of UTC
offset respectively).
from datetime import datetime
import pytz
import polars as pl
from datetime import date
# Make a datetime-only dataframe that covers DST period of year, in UTC time first.
df = pl.DataFrame(
pl.date_range(low=date(2022,1,3),
high=date(2022,9,30),
interval="5m",
time_unit="ns",
time_zone="UTC")
.alias("timestamp")
)
# Convert timezone to "America/New_York", which covers both EST and EDT.
us_df = df.with_column(
pl.col("timestamp")
.dt
.cast_time_zone(tz="America/New_York")
.alias("datetime")
)
# Check us_df output
us_df
# output, here `polars` is showing US time without the UTC offset
# Before 0.14.22 `polars` is showing time with UTC offset
# i.e., `23:45:00 UTC` should be `19:45:00 EDT`
# Now `polars` is showing `15:45:00 EDT`, without 4 hours of offset
┌─────────────────────────┬────────────────────────────────┐
│ timestamp ┆ datetime │
│ --- ┆ --- │
│ datetime[ns, UTC] ┆ datetime[ns, America/New_York] │
╞═════════════════════════╪════════════════════════════════╡
│ 2022-01-03 00:00:00 UTC ┆ 2022-01-02 14:00:00 EST │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-01-03 00:05:00 UTC ┆ 2022-01-02 14:05:00 EST │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-01-03 00:10:00 UTC ┆ 2022-01-02 14:10:00 EST │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-01-03 00:15:00 UTC ┆ 2022-01-02 14:15:00 EST │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-09-29 23:45:00 UTC ┆ 2022-09-29 15:45:00 EDT │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-09-29 23:50:00 UTC ┆ 2022-09-29 15:50:00 EDT │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-09-29 23:55:00 UTC ┆ 2022-09-29 15:55:00 EDT │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-09-30 00:00:00 UTC ┆ 2022-09-29 16:00:00 EDT │
└─────────────────────────┴────────────────────────────────┘
Converting to_pandas
, we should observe that the underlying datetime
object does not include that 4 hours of offset in the actual time as well (remember EST is also in this dataframe, and it has a 5-hour offset).
# Convert to pandas
us_pd = us_df.to_pandas()
us_pd
# output
timestamp datetime
0 2022-01-03 00:00:00+00:00 2022-01-02 14:00:00-05:00
1 2022-01-03 00:05:00+00:00 2022-01-02 14:05:00-05:00
2 2022-01-03 00:10:00+00:00 2022-01-02 14:10:00-05:00
3 2022-01-03 00:15:00+00:00 2022-01-02 14:15:00-05:00
4 2022-01-03 00:20:00+00:00 2022-01-02 14:20:00-05:00
... ... ...
77756 2022-09-29 23:40:00+00:00 2022-09-29 15:40:00-04:00
77757 2022-09-29 23:45:00+00:00 2022-09-29 15:45:00-04:00
77758 2022-09-29 23:50:00+00:00 2022-09-29 15:50:00-04:00
77759 2022-09-29 23:55:00+00:00 2022-09-29 15:55:00-04:00
77760 2022-09-30 00:00:00+00:00 2022-09-29 16:00:00-04:00
What I wanted was to include the UTC
offset into the actual time, such that I can do filtering on the time (in a natural way). For instance, if I am seeing 2300UTC is 1900EDT, I can filter using 1900 directly (please note I can't just add/substract the UTC
offset on the fly during filtering, as the number of hours is dynamic given DST).
The underlying python datetime
does have utcoffset
function, which can be applied on each datetime object, but I'd need to convert polars
to pandas
first (I don't see how to do this within polars
).
I've also observed this peculiar difference:
us_pd.datetime[us_pd.shape[0]-1].to_pydatetime()
# We can see it is identical to what's already in `polars` and `pandas` dataframe.
datetime.datetime(2022, 9, 29, 16, 0, tzinfo=<DstTzInfo 'America/New_York' EDT-1 day, 20:00:00 DST>)
# Now we create a single datetime object with arbitrary UTC time and convert it to New York time
datetime(2022, 9, 30, 22, 45, 0,0, pytz.utc).astimezone(pytz.timezone("America/New_York"))
# The representation here is actually the correct New York time (as in, the offset has been included)
datetime.datetime(2022, 9, 30, 18, 45, tzinfo=<DstTzInfo 'America/New_York' EDT-1 day, 20:00:00 DST>)
It seems you're looking for convert_time_zone
. Ex:
from datetime import date
import polars as pl
df = pl.DataFrame(
pl.datetime_range(
start=date(2022, 1, 3),
end=date(2022, 9, 30),
interval="5m",
time_unit="ns",
time_zone="UTC",
eager=True
).alias("timestamp")
)
us_df = df.with_columns(
pl.col("timestamp").dt.convert_time_zone(time_zone="America/New_York").alias("datetime")
)
shape: (77_761, 2)
┌─────────────────────────┬────────────────────────────────┐
│ timestamp ┆ datetime │
│ --- ┆ --- │
│ datetime[ns, UTC] ┆ datetime[ns, America/New_York] │
╞═════════════════════════╪════════════════════════════════╡
│ 2022-01-03 00:00:00 UTC ┆ 2022-01-02 19:00:00 EST │
│ 2022-01-03 00:05:00 UTC ┆ 2022-01-02 19:05:00 EST │
│ 2022-01-03 00:10:00 UTC ┆ 2022-01-02 19:10:00 EST │
│ 2022-01-03 00:15:00 UTC ┆ 2022-01-02 19:15:00 EST │
│ 2022-01-03 00:20:00 UTC ┆ 2022-01-02 19:20:00 EST │
│ … ┆ … │
│ 2022-09-29 23:40:00 UTC ┆ 2022-09-29 19:40:00 EDT │
│ 2022-09-29 23:45:00 UTC ┆ 2022-09-29 19:45:00 EDT │
│ 2022-09-29 23:50:00 UTC ┆ 2022-09-29 19:50:00 EDT │
│ 2022-09-29 23:55:00 UTC ┆ 2022-09-29 19:55:00 EDT │
│ 2022-09-30 00:00:00 UTC ┆ 2022-09-29 20:00:00 EDT │
└─────────────────────────┴────────────────────────────────┘