pandasconditional-statementsgroupingresample

Pandas - group rows on date condition


I have the following dataframe...

             date_time  day  value1  value2
0  2023-03-15 00:00:00    3       1       1
1  2023-03-15 06:00:00    3       2       2
2  2023-03-15 12:00:00    3       3       3
3  2023-03-15 18:00:00    3       4       4
4  2023-03-16 00:00:00    4       5       5
5  2023-03-16 06:00:00    4       6       6
6  2023-03-16 12:00:00    4       7       7
7  2023-03-16 18:00:00    4       8       8
8  2023-03-17 00:00:00    5       9       9
9  2023-03-17 06:00:00    5      10      10
10 2023-03-17 12:00:00    5      11      11
11 2023-03-17 18:00:00    5      12      12
12 2023-03-20 06:00:00    1      13      13

I am trying to group the data by day, offset by 7 hours. Using the following...

rows = df.resample('24H', on='date_time', origin='epoch', offset='7H').agg({
    'date_time': 'last', 'day': 'last', 'value1': 'first', 'value2': 'last'})

This gives me...

                              date_time  day  value1  value2
date_time                                                   
2023-03-14 07:00:00 2023-03-15 06:00:00  3.0     1.0     2.0
2023-03-15 07:00:00 2023-03-16 06:00:00  4.0     3.0     6.0
2023-03-16 07:00:00 2023-03-17 06:00:00  5.0     7.0    10.0
2023-03-17 07:00:00 2023-03-17 18:00:00  5.0    11.0    12.0
2023-03-19 07:00:00 2023-03-20 06:00:00  1.0    13.0    13.0

This is almost correct, but what I also want is to group any data that has a gap of more than 24 hours into a single row, like this...

                              date_time  day  value1  value2
date_time                                                   
2023-03-14 07:00:00 2023-03-15 06:00:00  3.0     1.0     2.0
2023-03-15 07:00:00 2023-03-16 06:00:00  4.0     3.0     6.0
2023-03-16 07:00:00 2023-03-17 06:00:00  5.0     7.0    10.0
2023-03-19 07:00:00 2023-03-20 06:00:00  1.0    11.0    13.0

After I had resampled the data, I tried to group the resampled data, using...

rows = rows.groupby(((rows.date_time - rows.date_time.shift(-1)) < '-24H').cumsum()
       ).agg({'date_time': 'last', 'day': 'last', 'value1': 'first', 'value2': 'last'})

...but this resulted in...

                    date_time  day  value1  value2
date_time                                         
0         2023-03-17 06:00:00  5.0     1.0    10.0
1         2023-03-20 06:00:00  1.0    11.0    13.0

So in a nutshell, I want the first 3 rows from the resample, and the last row from the grouping to achieve what I am after.

I'm close, but can't quite figure it out.

Is there a way to do what I am after?


Solution

  • This is essentially a gap-and-island problem: when the difference is 1 day or less, you create a new island.

    agg_dict = {
        "date_time": "last",
        "day": "last",
        "value1": "first",
        "value2": "last",
    }
    rows = df.resample("24H", on="date_time", offset="7H").agg(agg_dict).dropna()
    
    island = rows.index.to_series().diff().le("1D").cumsum().rename("island")
    rows = rows.groupby(island).agg(agg_dict)