I have the following dataframe.
padel start_time end_time duration
38 Padel 10 08:00:00 09:00:00 60
40 Padel 10 10:00:00 11:30:00 90
42 Padel 10 10:30:00 12:00:00 90
44 Padel 10 11:00:00 12:30:00 90
46 Padel 10 11:30:00 13:00:00 90
49 Padel 10 16:00:00 17:30:00 90
51 Padel 10 16:30:00 18:00:00 90
53 Padel 10 17:00:00 18:30:00 90
55 Padel 10 17:30:00 19:00:00 90
57 Padel 10 18:00:00 19:30:00 90
59 Padel 10 18:30:00 20:00:00 90
61 Padel 10 19:00:00 20:30:00 90
63 Padel 10 19:30:00 21:00:00 90
65 Padel 10 20:00:00 21:30:00 90
67 Padel 10 20:30:00 22:00:00 90
I want to chose the longest timespans in between. The output I want should look like this
padel start_time end_time duration
38 Padel 10 08:00:00 09:00:00 60
40 Padel 10 10:00:00 13:00:00 180
49 Padel 10 16:00:00 22:00:00 360
I not care about duration. I can do that. but how will i merge the time spans which overlap. Thanks
shift()
to create groups if start_time
is greater than
end_time
of row above (i.e. overlapping).fillna
with '24:00:00'
so that we return 'True' for first value as nothing can be greater than 24 hours for a day. That's because NaN
is the output in first row with shift()
which would return False
if we didn't do this.boolean
series of True
and False
(i.e. 1
and 0
,. respectively), so you just take the cumulative sum with cumsum
.grp
object, which we can include in groupby
.df = df.sort_values(by=['padel', 'start_time'], ascending=[True, True])
grp = df['start_time'].gt(df['end_time'].shift().fillna('24:00:00')).cumsum()
df = df.groupby([grp, 'padel'], as_index=False).agg({'start_time':'first', 'end_time':'last'})
df['duration'] = ((pd.to_timedelta(df['end_time']) -
pd.to_timedelta(df['start_time'])).dt.seconds / 60).astype(int)
Out[1]:
padel start_time end_time duration
0 Padel 10 08:00:00 09:00:00 60
1 Padel 10 10:00:00 13:00:00 180
2 Padel 10 16:00:00 22:00:00 360
Full Code with input dataframe
df = pd.DataFrame(pd.DataFrame({'padel': {38: 'Padel 10',
40: 'Padel 10',
42: 'Padel 10',
44: 'Padel 10',
46: 'Padel 10',
49: 'Padel 10',
51: 'Padel 10',
53: 'Padel 10',
55: 'Padel 10',
57: 'Padel 10',
59: 'Padel 10',
61: 'Padel 10',
63: 'Padel 10',
65: 'Padel 10',
67: 'Padel 10'},
'start_time': {38: '08:00:00',
40: '10:00:00',
42: '10:30:00',
44: '11:00:00',
46: '11:30:00',
49: '16:00:00',
51: '16:30:00',
53: '17:00:00',
55: '17:30:00',
57: '18:00:00',
59: '18:30:00',
61: '19:00:00',
63: '19:30:00',
65: '20:00:00',
67: '20:30:00'},
'end_time': {38: '09:00:00',
40: '11:30:00',
42: '12:00:00',
44: '12:30:00',
46: '13:00:00',
49: '17:30:00',
51: '18:00:00',
53: '18:30:00',
55: '19:00:00',
57: '19:30:00',
59: '20:00:00',
61: '20:30:00',
63: '21:00:00',
65: '21:30:00',
67: '22:00:00'},
'duration': {38: 60,
40: 90,
42: 90,
44: 90,
46: 90,
49: 90,
51: 90,
53: 90,
55: 90,
57: 90,
59: 90,
61: 90,
63: 90,
65: 90,
67: 90}}))
grp = df['start_time'].gt(df['end_time'].shift().fillna('24:00:00')).cumsum()
df = df.groupby([grp, 'padel'], as_index=False).agg({'start_time':'first', 'end_time':'last'})
df['duration'] = ((pd.to_timedelta(df['end_time']) - \
pd.to_timedelta(df['start_time'])).dt.seconds / 60).astype(int)
df