I want to group by consecutive values in col1
and for each group, select the earliest date and hour from col2
and col3
and the latest date and hour from col4
and col5
.
Here is my data:
col1 | col2 | col3 | col4 | col5 | |
---|---|---|---|---|---|
0 | a | 2021-07-03 | 17:08 | 2021-07-04 | 10:41 |
1 | b | 2021-07-10 | 04:14 | 2021-07-11 | 04:32 |
2 | c | 2021-07-13 | 02:03 | 2021-07-14 | 00:45 |
3 | d | 2021-07-14 | 21:23 | 2021-07-15 | 02:59 |
4 | d | 2021-07-15 | 04:05 | 2021-07-15 | 09:41 |
5 | e | 2021-07-17 | 13:50 | 2021-07-18 | 08:49 |
6 | a | 2021-07-18 | 10:51 | 2021-07-18 | 12:27 |
7 | a | 2021-07-18 | 13:55 | 2021-07-19 | 06:26 |
8 | f | 2021-09-20 | 22:36 | 2021-09-20 | 23:19 |
9 | f | 2021-09-21 | 23:45 | 2021-09-23 | 10:12 |
Expected output:
col1 | col2 | col3 | col4 | col5 | |
---|---|---|---|---|---|
0 | a | 2021-07-03 | 17:08 | 2021-07-04 | 10:41 |
1 | b | 2021-07-10 | 04:14 | 2021-07-11 | 04:32 |
2 | c | 2021-07-13 | 02:03 | 2021-07-14 | 00:45 |
3 | d | 2021-07-14 | 21:23 | 2021-07-15 | 09:41 |
4 | e | 2021-07-17 | 13:50 | 2021-07-18 | 08:49 |
5 | a | 2021-07-18 | 10:51 | 2021-07-19 | 06:26 |
6 | f | 2021-09-20 | 22:36 | 2021-09-23 | 10:12 |
This solution expects the dates to be sorted from the earliest to the latest, as in the provided example data.
from itertools import groupby
from io import StringIO
import pandas as pd
df = pd.read_csv(
StringIO(
"""col1 col2 col3 col4 col5
a 2021-07-03 17:08 2021-07-04 10:41
b 2021-07-10 04:14 2021-07-11 04:32
c 2021-07-13 02:03 2021-07-14 00:45
d 2021-07-14 21:23 2021-07-15 02:59
d 2021-07-15 04:05 2021-07-15 09:41
e 2021-07-17 13:50 2021-07-18 08:49
a 2021-07-18 10:51 2021-07-18 12:27
a 2021-07-18 13:55 2021-07-19 06:26
f 2021-09-20 22:36 2021-09-20 23:19
f 2021-09-21 23:45 2021-09-23 10:12
"""
),
delim_whitespace=True,
header=0,
)
# Group by consecutive values in col1
groups = [list(group) for key, group in groupby(df["col1"].values.tolist())]
group_end_indices = pd.Series(len(g) for g in groups).cumsum()
group_start_indices = (group_end_indices - group_end_indices.diff(1)).fillna(0).astype(int)
filtered_df = []
for start_ix, end_ix in zip(group_start_indices, group_end_indices):
group = df.iloc[start_ix:end_ix]
if group.shape[0] > 1:
group.iloc[0][["col4", "col5"]] = group.iloc[-1][["col4", "col5"]]
filtered_df.append(group.iloc[0])
filtered_df = pd.DataFrame(filtered_df).reset_index(drop=True)
print(filtered_df)
Output:
col1 col2 col3 col4 col5
0 a 2021-07-03 17:08 2021-07-04 10:41
1 b 2021-07-10 04:14 2021-07-11 04:32
2 c 2021-07-13 02:03 2021-07-14 00:45
3 d 2021-07-14 21:23 2021-07-15 09:41
4 e 2021-07-17 13:50 2021-07-18 08:49
5 a 2021-07-18 10:51 2021-07-19 06:26
6 f 2021-09-20 22:36 2021-09-23 10:12