Calculate the duration of overlapping time ranges using pandas

I have large csv files of traffic data similar to the sample below, for which I need to calculate the total bytes and the duration of each data transfer. The time ranges are overlapping, but they must be merged:

first_packet_ts last_packet_ts  bytes_uplink bytes_downlink service    user_id
1441901695012   1441901696009       165             1212    facebook    3
1441901695500   1441901696212        23             4321    facebook    3
1441901698000   1441901698010       242             3423    youtube     4
1441901698400   1441901698500       423             2344    youtube     4

Desired output:

 duration     bytes_uplink      bytes_downlink    service          user_id
   1200             188             5533          facebook            3
   110              665             5767          youtube             4

I currently use something like the following lines:

df = pd.read_csv(input_file_path)
df = df.groupby(['service', 'user_id'])
durations = df.apply(calculate_duration) 
df = df[['bytes_uplink', 'bytes_downlink']].sum()
df = df.reset_index()

The calculate_duration function (below) iterates the contents of each group, merges the overlapping time intervals and then returns a dataframe which is then concatenated to the summed dataframe df.

def calculate_duration(group):
    ranges = group[['first_packet_ts', 'last_packet_ts']].itertuples()
    duration = 0
    for i,current_start, current_stop in ranges:
        for i, start, stop in ranges:
            if start > current_stop:
                duration += current_stop - current_start
                current_start, current_stop = start, stop
            else:
                current_stop = max(current_stop, stop)
        duration += current_stop - current_start
    return duration

This approach is very slow as it involves iteration and invoking the apply method for each group.

Is there a more efficient way to calculate the duration of the data transfer, merging the overlapping intervals, using pandas (avoid iteration somehow?) preferably without resorting to cython?

Solution

How about this? (having timed it, might bit slower...)

pd.pivot_table(df, columns='user_id', index='service',
               values=['bytes_uplink', 'bytes_downlink'], aggfunc=sum)

Edit: I don't think this is any more valid than yours but you could try something along these lines:

# create dummy start/end dataframe
df = pd.DataFrame({'end':pd.Series([50, 100, 120, 150]), 'start':pd.Series([30, 0, 40, 130])})
df = df[['start', 'end']]
df = df.sort('start')

df['roll_end'] = df.end.cummax()
df.roll_end = df.roll_end.shift()

df['new_start'] = df.start
overlap = df.start - df.roll_end < 0
# if start is before rolling max end time then reset start to rolling max end time
df.new_start[overlap] = df.roll_end[overlap]

# if the new start is after end, then completely overlapping
print np.sum([x for x in df.end - df.new_start if x > 0])