I have a dataset in the following form:
2024-10-27T00:00 1 A
2024-10-27T00:15 2 B
2024-10-27T00:30 3 C
2024-10-27T00:45 4 D
2024-10-27T01:00 5 E
2024-10-27T01:15 6 F
2024-10-27T01:30 7 G
2024-10-27T01:45 8 H
2024-10-27T02:00 9 I
2024-10-27T02:00 10 J
2024-10-27T02:15 11 K
2024-10-27T02:15 12 L
2024-10-27T02:30 13 M
2024-10-27T02:30 14 N
2024-10-27T02:45 15 O
2024-10-27T02:45 16 P
2024-10-27T03:00 17 Q
2024-10-27T03:15 18 R
2024-10-27T03:30 19 S
2024-10-27T03:45 20 T
This is in local CE(S)T time, I want to let pandas infer the DST change. However it throws an error that there are 4 DST changes where it expects only one.
Interestingly, the code below does work:
import pandas as pd
data = [
["2024-10-27T00:00", 1, "A"],
["2024-10-27T00:15", 2, "B"],
["2024-10-27T00:30", 3, "C"],
["2024-10-27T00:45", 4, "D"],
["2024-10-27T01:00", 5, "E"],
["2024-10-27T01:15", 6, "F"],
["2024-10-27T01:30", 7, "G"],
["2024-10-27T01:45", 8, "H"],
["2024-10-27T02:00", 9, "I"],
["2024-10-27T02:15", 10, "J"],
["2024-10-27T02:30", 11, "K"],
["2024-10-27T02:45", 12, "L"],
["2024-10-27T02:00", 13, "M"],
["2024-10-27T02:15", 14, "N"],
["2024-10-27T02:30", 15, "O"],
["2024-10-27T02:45", 16, "P"],
["2024-10-27T03:00", 17, "Q"],
["2024-10-27T03:15", 18, "R"],
["2024-10-27T03:30", 19, "S"],
["2024-10-27T03:45", 20, "T"],
]
df = pd.DataFrame(data, columns=["timestamp", "number", "letter"])
df.set_index("timestamp", inplace=True)
df.index=pd.to_datetime(df.index)
df.index=df.index.tz_localize('Europe/Amsterdam',ambiguous='infer')
In the latter example Pandas seems to interpret (rightfully so) that the first occurence indeed is summertime and the second is wintertime. Is there a way to deal with this (easily)? Hardcoding the re-order would be a way, but seems cumbersome (honestly wouldn't even really know how, something like filtering for duplicates and then looping and swapping the rows?).
Anybody have thoughts on this?
You could sort by the hour and the cumcount of the timestamp groups to programmatically get the order that's accepted by tz_localize
. Of course this would need to be slightly adapted if you have multiple days or even years of data, but the idea remains the same.
If so desired you could finally sort by the original number col (or whatever dedicated serial index) to restore the original order.
import pandas as pd
data = [
["2024-10-27T00:00", 1, "A"],
["2024-10-27T00:15", 2, "B"],
["2024-10-27T00:30", 3, "C"],
["2024-10-27T00:45", 4, "D"],
["2024-10-27T01:00", 5, "E"],
["2024-10-27T01:15", 6, "F"],
["2024-10-27T01:30", 7, "G"],
["2024-10-27T01:45", 8, "H"],
["2024-10-27T02:00", 9, "I"],
["2024-10-27T02:00", 13, "M"],
["2024-10-27T02:15", 10, "J"],
["2024-10-27T02:15", 14, "N"],
["2024-10-27T02:30", 11, "K"],
["2024-10-27T02:30", 15, "O"],
["2024-10-27T02:45", 12, "L"],
["2024-10-27T02:45", 16, "P"],
["2024-10-27T03:00", 17, "Q"],
["2024-10-27T03:15", 18, "R"],
["2024-10-27T03:30", 19, "S"],
["2024-10-27T03:45", 20, "T"],
]
df = pd.DataFrame(data, columns=["timestamp", "number", "letter"])
df["timestamp"] = pd.to_datetime(df["timestamp"])
df = df.assign(
cumcount=lambda df: df.groupby("timestamp").cumcount(),
hour=df.timestamp.dt.hour
).sort_values(["hour", "cumcount"])
df.set_index("timestamp", inplace=True)
df.index = df.index.tz_localize('Europe/Amsterdam',ambiguous='infer')
df