here's our test data to work with:
import polars as pl
import pandas as pd
from datetime import date, time, datetime
df = pl.DataFrame(
pl.datetime_range(
start=date(2022, 1, 3),
end=date(2022, 9, 30),
interval="5m",
time_unit="ns",
time_zone="UTC",
eager=True
).alias("UTC")
)
I specifically need replace_time_zone
to actually change the underlying timestamp.
It works with convert_time_zone
:
df.select(
pl.col("UTC").dt.convert_time_zone(time_zone="America/New_York").alias("US")
)
shape: (77_761, 1)
┌────────────────────────────────┐
│ US │
│ --- │
│ datetime[ns, America/New_York] │
╞════════════════════════════════╡
│ 2022-01-02 19:00:00 EST │
│ 2022-01-02 19:05:00 EST │
│ 2022-01-02 19:10:00 EST │
│ 2022-01-02 19:15:00 EST │
│ 2022-01-02 19:20:00 EST │
│ … │
│ 2022-09-29 19:40:00 EDT │
│ 2022-09-29 19:45:00 EDT │
│ 2022-09-29 19:50:00 EDT │
│ 2022-09-29 19:55:00 EDT │
│ 2022-09-29 20:00:00 EDT │
└────────────────────────────────┘
But fails with replace_time_zone
:
df.select(
pl.col("UTC").dt.replace_time_zone(time_zone="America/New_York").alias("US")
)
# ComputeError: datetime '2022-03-13 02:00:00' is non-existent in time zone 'America/New_York'.
# You may be able to use `non_existent='null'` to return `null` in this case.
You cannot replace the timezone in a UTC time series with a timezone that has DST transitions - you'll end up with non-existing and/or missing datetimes. The error could be a bit more informative, but I do not think this is specific to polars.
Here's an illustration. "America/New_York" had a DST transition on Mar 13. 2 am
did not exist on that day... so this works fine:
import polars as pl
from datetime import date
df = pl.DataFrame(
pl.datetime_range(
start=date(2022, 3, 11),
end=date(2022, 3, 13),
interval="5m",
time_unit="ns",
time_zone="UTC",
eager=True,
).alias("UTC")
)
print(
df.select(
pl.col("UTC").dt.replace_time_zone(time_zone="America/New_York").alias("US")
)
)
# shape: (289, 1)
# ┌────────────────────────────────┐
# │ US │
# │ --- │
# │ datetime[ns, America/New_York] │
# ╞════════════════════════════════╡
# │ 2022-03-11 00:00:00 EST │
# │ 2022-03-11 00:05:00 EST │
# │ 2022-03-11 00:10:00 EST │
# │ 2022-03-11 00:15:00 EST │
# │ … │
while this doesn't:
df = pl.DataFrame(
pl.datetime_range(
start=date(2022, 3, 13),
end=date(2022, 3, 15),
interval="5m",
time_unit="ns",
time_zone="UTC",
eager=True,
).alias("UTC")
)
print(
df.select(
pl.col("UTC").dt.replace_time_zone(time_zone="America/New_York").alias("US")
)
)
# ComputeError: datetime '2022-03-13 02:00:00' is non-existent in time zone
Workaround you could use is to convert UTC to the desired timezone, then add its UTC offset. Ex:
df = pl.DataFrame(
pl.datetime_range(
start=date(2022, 1, 3),
end=date(2022, 9, 30),
interval="5m",
time_unit="ns",
time_zone="UTC",
eager=True,
).alias("UTC")
)
df = df.with_columns(
pl.col("UTC").dt.convert_time_zone(time_zone="America/New_York").alias("US")
)
df = df.with_columns(
(pl.col("US")+(pl.col("UTC")-pl.col("US").dt.replace_time_zone(time_zone="UTC")))
.alias("US_fakeUTC")
)
print(df.select(pl.col("US_fakeUTC")))
# shape: (77_761, 1)
# ┌────────────────────────────────┐
# │ US_fakeUTC │
# │ --- │
# │ datetime[ns, America/New_York] │
# ╞════════════════════════════════╡
# │ 2022-01-03 00:00:00 EST │
# │ 2022-01-03 00:05:00 EST │
# │ 2022-01-03 00:10:00 EST │
# │ 2022-01-03 00:15:00 EST │
# │ … │