I have the following dataframe...
date_time day value1 value2
0 2023-03-15 00:00:00 3 1 1
1 2023-03-15 06:00:00 3 2 2
2 2023-03-15 12:00:00 3 3 3
3 2023-03-15 18:00:00 3 4 4
4 2023-03-16 00:00:00 4 5 5
5 2023-03-16 06:00:00 4 6 6
6 2023-03-16 12:00:00 4 7 7
7 2023-03-16 18:00:00 4 8 8
8 2023-03-17 00:00:00 5 9 9
9 2023-03-17 06:00:00 5 10 10
10 2023-03-17 12:00:00 5 11 11
11 2023-03-17 18:00:00 5 12 12
12 2023-03-20 06:00:00 1 13 13
I am trying to group the data by day, offset by 7 hours. Using the following...
rows = df.resample('24H', on='date_time', origin='epoch', offset='7H').agg({
'date_time': 'last', 'day': 'last', 'value1': 'first', 'value2': 'last'})
This gives me...
date_time day value1 value2
date_time
2023-03-14 07:00:00 2023-03-15 06:00:00 3.0 1.0 2.0
2023-03-15 07:00:00 2023-03-16 06:00:00 4.0 3.0 6.0
2023-03-16 07:00:00 2023-03-17 06:00:00 5.0 7.0 10.0
2023-03-17 07:00:00 2023-03-17 18:00:00 5.0 11.0 12.0
2023-03-19 07:00:00 2023-03-20 06:00:00 1.0 13.0 13.0
This is almost correct, but what I also want is to group any data that has a gap of more than 24 hours into a single row, like this...
date_time day value1 value2
date_time
2023-03-14 07:00:00 2023-03-15 06:00:00 3.0 1.0 2.0
2023-03-15 07:00:00 2023-03-16 06:00:00 4.0 3.0 6.0
2023-03-16 07:00:00 2023-03-17 06:00:00 5.0 7.0 10.0
2023-03-19 07:00:00 2023-03-20 06:00:00 1.0 11.0 13.0
After I had resampled the data, I tried to group the resampled data, using...
rows = rows.groupby(((rows.date_time - rows.date_time.shift(-1)) < '-24H').cumsum()
).agg({'date_time': 'last', 'day': 'last', 'value1': 'first', 'value2': 'last'})
...but this resulted in...
date_time day value1 value2
date_time
0 2023-03-17 06:00:00 5.0 1.0 10.0
1 2023-03-20 06:00:00 1.0 11.0 13.0
So in a nutshell, I want the first 3 rows from the resample, and the last row from the grouping to achieve what I am after.
I'm close, but can't quite figure it out.
Is there a way to do what I am after?
This is essentially a gap-and-island problem: when the difference is 1 day or less, you create a new island.
agg_dict = {
"date_time": "last",
"day": "last",
"value1": "first",
"value2": "last",
}
rows = df.resample("24H", on="date_time", offset="7H").agg(agg_dict).dropna()
island = rows.index.to_series().diff().le("1D").cumsum().rename("island")
rows = rows.groupby(island).agg(agg_dict)