pythonpandasdatetimetimespanrelative-time-span

Python how to merge the time spans and make a bigger one


I have the following dataframe.

       padel start_time  end_time  duration
38  Padel 10   08:00:00  09:00:00        60
40  Padel 10   10:00:00  11:30:00        90
42  Padel 10   10:30:00  12:00:00        90
44  Padel 10   11:00:00  12:30:00        90
46  Padel 10   11:30:00  13:00:00        90
49  Padel 10   16:00:00  17:30:00        90
51  Padel 10   16:30:00  18:00:00        90
53  Padel 10   17:00:00  18:30:00        90
55  Padel 10   17:30:00  19:00:00        90
57  Padel 10   18:00:00  19:30:00        90
59  Padel 10   18:30:00  20:00:00        90
61  Padel 10   19:00:00  20:30:00        90
63  Padel 10   19:30:00  21:00:00        90
65  Padel 10   20:00:00  21:30:00        90
67  Padel 10   20:30:00  22:00:00        90

I want to chose the longest timespans in between. The output I want should look like this

       padel start_time  end_time  duration
38  Padel 10   08:00:00  09:00:00        60
40  Padel 10   10:00:00  13:00:00        180
49  Padel 10   16:00:00  22:00:00        360

I not care about duration. I can do that. but how will i merge the time spans which overlap. Thanks


Solution

    1. You can use shift() to create groups if start_time is greater than end_time of row above (i.e. overlapping).
    2. We fillna with '24:00:00' so that we return 'True' for first value as nothing can be greater than 24 hours for a day. That's because NaN is the output in first row with shift() which would return False if we didn't do this.
    3. That returns a boolean series of True and False (i.e. 1 and 0,. respectively), so you just take the cumulative sum with cumsum.
    4. This creates a grp object, which we can include in groupby.

    df = df.sort_values(by=['padel', 'start_time'], ascending=[True, True])
    grp = df['start_time'].gt(df['end_time'].shift().fillna('24:00:00')).cumsum() 
    df = df.groupby([grp, 'padel'], as_index=False).agg({'start_time':'first', 'end_time':'last'})
    df['duration'] = ((pd.to_timedelta(df['end_time']) - 
                       pd.to_timedelta(df['start_time'])).dt.seconds / 60).astype(int)
    Out[1]: 
          padel start_time  end_time  duration
    0  Padel 10   08:00:00  09:00:00        60
    1  Padel 10   10:00:00  13:00:00       180
    2  Padel 10   16:00:00  22:00:00       360
    

    Full Code with input dataframe

    df = pd.DataFrame(pd.DataFrame({'padel': {38: 'Padel 10',
      40: 'Padel 10',
      42: 'Padel 10',
      44: 'Padel 10',
      46: 'Padel 10',
      49: 'Padel 10',
      51: 'Padel 10',
      53: 'Padel 10',
      55: 'Padel 10',
      57: 'Padel 10',
      59: 'Padel 10',
      61: 'Padel 10',
      63: 'Padel 10',
      65: 'Padel 10',
      67: 'Padel 10'},
     'start_time': {38: '08:00:00',
      40: '10:00:00',
      42: '10:30:00',
      44: '11:00:00',
      46: '11:30:00',
      49: '16:00:00',
      51: '16:30:00',
      53: '17:00:00',
      55: '17:30:00',
      57: '18:00:00',
      59: '18:30:00',
      61: '19:00:00',
      63: '19:30:00',
      65: '20:00:00',
      67: '20:30:00'},
     'end_time': {38: '09:00:00',
      40: '11:30:00',
      42: '12:00:00',
      44: '12:30:00',
      46: '13:00:00',
      49: '17:30:00',
      51: '18:00:00',
      53: '18:30:00',
      55: '19:00:00',
      57: '19:30:00',
      59: '20:00:00',
      61: '20:30:00',
      63: '21:00:00',
      65: '21:30:00',
      67: '22:00:00'},
     'duration': {38: 60,
      40: 90,
      42: 90,
      44: 90,
      46: 90,
      49: 90,
      51: 90,
      53: 90,
      55: 90,
      57: 90,
      59: 90,
      61: 90,
      63: 90,
      65: 90,
      67: 90}}))
    grp = df['start_time'].gt(df['end_time'].shift().fillna('24:00:00')).cumsum() 
    df = df.groupby([grp, 'padel'], as_index=False).agg({'start_time':'first', 'end_time':'last'})
    df['duration'] = ((pd.to_timedelta(df['end_time']) - \
                       pd.to_timedelta(df['start_time'])).dt.seconds / 60).astype(int)
    df