[SOLVED] Group Pandas DataFrame by Continuous Date Ranges

Group Pandas DataFrame by Continuous Date Ranges

I have a Pandas DataFrame that looks like this:

import pandas as pd

data = {
    'date': ['2023-01-01 00:00:00', '2023-01-01 06:00:00', '2023-01-01 12:00:00',
             '2023-01-02 00:00:00', '2023-01-02 06:00:00', '2023-01-03 00:00:00'],
    'x': [1, 1, 1, 2, 2, 1],
    'y': ['A', 'A', 'A', 'B', 'B', 'A'],
    'z': [10, 10, 10, 20, 20, 10]
}
df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date'])
print(df)

The DataFrame contains the following columns:

date: Timestamps
x, y, z: Grouping identifiers

I want to group rows based on the values in columns x, y, and z. Additionally, I would like to create groups for rows where the date difference between consecutive rows is no more than 6 hours (6H). The goal is to obtain date ranges for each group with matching x, y, z values that are continuously spaced within a 6-hour gap.

Here is an illustration of the desired output:

x	y	z	date_range
1	A	10	2023-01-01 00:00:00 to 2023-01-01 12:00:00
2	B	20	2023-01-02 00:00:00 to 2023-01-02 06:00:00
1	A	10	2023-01-03 00:00:00 to 2023-01-03 00:00:00

I have tried using groupby with a custom aggregation function and also merge_asof to account for the 6-hour gap, but neither approach worked as expected to create the correct groupings.

How can I achieve this grouping in Pandas?

Solution

Code

grp = df.groupby(['x', 'y', 'z'])['date'].diff().gt('6H').groupby(
    [df['x'], df['y'], df['z']]
).cumsum()

tmp = df.groupby(
    ['x', 'y', 'z', grp], sort=False
)['date'].agg(['first', 'last']).droplevel(-1)

fmt = '%Y-%m-%d %H:%M:%S'

out = tmp['first'].dt.strftime(fmt).str.cat(
    tmp['last'].dt.strftime(fmt), sep=' to '
).reset_index(name='date_range')

out:

   x  y   z                                  date_range
0  1  A  10  2023-01-01 00:00:00 to 2023-01-01 12:00:00
1  2  B  20  2023-01-02 00:00:00 to 2023-01-02 06:00:00
2  1  A  10  2023-01-03 00:00:00 to 2023-01-03 00:00:00