I recently asked a question regarding grouping pandas rows on a date condition, and got an excellent answer see here
However, it got me thinking that if this functionality was placed into a common function, then there is a scenario where this would not work.
So, using the same data from the previous question...
date_time day value1 value2
0 2023-03-15 00:00:00 3 1 1
1 2023-03-15 06:00:00 3 2 2
2 2023-03-15 12:00:00 3 3 3
3 2023-03-15 18:00:00 3 4 4
4 2023-03-16 00:00:00 4 5 5
5 2023-03-16 06:00:00 4 6 6
6 2023-03-16 12:00:00 4 7 7
7 2023-03-16 18:00:00 4 8 8
8 2023-03-17 00:00:00 5 9 9
9 2023-03-17 06:00:00 5 10 10
10 2023-03-17 12:00:00 5 11 11
11 2023-03-17 18:00:00 5 12 12
12 2023-03-20 06:00:00 1 13 13
If I perform the same resampling, except without an offset...
agg_dict = {
"date_time": "last",
"day": "last",
"value1": "first",
"value2": "last",
}
rows = df.resample("24H", on="date_time").agg(agg_dict).dropna()
This gives me...
date_time day value1 value2
date_time
2023-03-15 2023-03-15 18:00:00 3.0 1.0 4.0
2023-03-16 2023-03-16 18:00:00 4.0 5.0 8.0
2023-03-17 2023-03-17 18:00:00 5.0 9.0 12.0
2023-03-20 2023-03-20 06:00:00 1.0 13.0 13.0
...which is correct, as all the data is grouped into the relevant days.
But if I then run the following commands on it (which is required when there is an offset), then the data would no longer be correct...
island = rows.index.to_series().diff().le("1D").cumsum().rename("island")
rows = rows.groupby(island).agg(agg_dict)
...the data from the last 2 rows (Friday/Monday) get grouped into a single row...
date_time day value1 value2
island
0 2023-03-15 18:00:00 3.0 1.0 4.0
1 2023-03-16 18:00:00 4.0 5.0 8.0
2 2023-03-20 06:00:00 1.0 9.0 13.0
I could perform some if logic to either perform the last 2 commands or not, but that doesn't seem a very elegant way of doing it. It would be good if both scenario's returned the correct results from the same test(s).
In a nutshell, I am trying to group data into 24H blocks. Any data that crosses a day boundary should be joined to the adjacent days data to complete the 24H block (including weekends).
So basically, if a 24H block of data roles over into a new day, it was only a problem when it crossed a gap, i.e. weekend, which has now beened solved, but any data that does not cross a day boundary, i.e. the data starts and ends on a Friday, should not be joined with Mondays data.
So if the data crosses a gap, I need to use the gap-n-island solution, but if the data doesn't cross a gap, then the data should be left as-is.
Friday Monday
| || |
XXXXXXXXXXXXXXXXXXXXXXXX - gap crossed - join data
XXXXXXXXXXXXXXXXXXXXXXXX - gap not crossed - do nothing
Could (and is there) an additional test that can be added into the gap-n-island solution code somehow to allow both scenario's to work?
I think I finally figure out you logic: after the resample, if the date_time
column goes into the next day then group that row with the next day.
You can tweak the island logic:
# if a row's date_time is on or before next row's date_time, give
# the row its own island. Else include it in next row's island.
island = (
rows["date_time"]
.le(rows.index.to_series().shift(-1, fill_value=pd.Timestamp.max))
.cumsum()
.rename("island")
)
rows.groupby(island).agg(agg_dict)