pythonpython-3.xpandasnumpynumexpr

multiply dataframes based on timestamp intervals overlap


I have two pandas dataframes, each with two columns: a measurement and a timestamp. I need to multiply the first differences of the measurements, but only if there is a time overlap between the two measurement intervals. How can I do this efficiently, as the size of the dataframes gets large? Example:

dfA
      mesA   timeA
0     125    2015-01-14 04:44:49     
1     100    2015-01-14 05:16:23
2     115    2015-01-14 08:57:10     

dfB
      mesB    timeB
0     140     2015-01-14 00:13:17
1     145     2015-01-14 08:52:01
2     120     2015-01-14 11:31:44

Here I would multiply (100-125)*(145-140) since there is a time overlap between the intervals [04:44:49, 05:16:23] and [00:13:17, 08:52:01], but not (100-125)and(120-145), since there isn't one. Similarly, I would have (115-100)*(145-140) but also (115-100)*(120-145), since both have a time overlap.

In the end I will have to sum all the relevant products in a single value, so the result need not be a dataframe. In this case:

s = (100-125)*(145-140)+(115-100)*(145-140)+(115-100)*(120-145) = -425

My current solution:

s = 0
for i in range(1, len(dfA)):
    startA = dfA['timeA'][i-1]
    endA = dfA['timeA'][i]
    for j in range(1, len(dfB)):
        startB = dfB['timeB'][j-1]
        endB = dfB['timeB'][j]
        if (endB>startA) & (startB<endA):
            s+=(dfA['mesA'][i]-dfA['mesA'][i-1])*(dfB['mesB'][j]-dfB['mesB'][j-1])

Although it seems to work, it is very inefficient and becomes impractical with very large datasets. I believe it could be vectorized more efficiently, perhaps using numexpr, but I still haven't found a way.

EDIT: other data

    mesA  timeA
0   125   2015-01-14 05:54:03
1   100   2015-01-14 11:39:53
2   115   2015-01-14 23:58:13
    mesB  timeB
0   110   2015-01-14 10:58:32
1   120   2015-01-14 13:30:00
2   135   2015-01-14 22:29:26

s = 125

Solution

  • Edit: the original answer did not work, so I came up with another version that is not vectorize but they need to be sorted by date.

    arrA = dfA.timeA.to_numpy()
    startA, endA = arrA[0], arrA[1]
    arr_mesA = dfA.mesA.diff().to_numpy()
    mesA = arr_mesA[1]
    
    arrB = dfB.timeB.to_numpy()
    startB, endB = arrB[0], arrB[1]
    arr_mesB = dfB.mesB.diff().to_numpy()
    mesB = arr_mesB[1]
    
    s = 0
    i, j = 1, 1
    imax = len(dfA)-1
    jmax = len(dfB)-1
    while True:
        if (endB>startA) & (startB<endA):
            s+=mesA*mesB
        if (endB>endA) and (i<imax):
            i+=1
            startA, endA, mesA= endA, arrA[i], arr_mesA[i]
        elif j<jmax:
            j+=1
            startB, endB, mesB = endB, arrB[j], arr_mesB[j]
        else:
            break
    

    Original not working answer

    The idea is to great category with pd.cut based on the value in dfB['timeB'] in both dataframes to see where they could overlap. Then calculate the diff in measurements. merge both dataframes on categories and finally multiply and sum the whole thing

    # create bins
    bins_dates = [min(dfB['timeB'].min(), dfA['timeA'].min())-pd.DateOffset(hours=1)]\
                 + dfB['timeB'].tolist()\
                 + [max(dfB['timeB'].max(), dfA['timeA'].max())+pd.DateOffset(hours=1)]
    
    # work on dfB
    dfB['cat'] = pd.cut(dfB['timeB'], bins=bins_dates,
                        labels=range(len(bins_dates)-1), right=False)
    dfB['deltaB'] = -dfB['mesB'].diff(-1).ffill()
    
    # work on dfA
    dfA['cat'] = pd.cut(dfA['timeA'], bins=bins_dates,
                        labels=range(len(bins_dates)-1), right=False)
    # need to calcualte delta for both start and end of intervals
    dfA['deltaAStart'] = -dfA['mesA'].diff(-1)
    dfA['deltaAEnd'] = dfA['mesA'].diff().mask(dfA['cat'].astype(float).diff().eq(0))
    # in the above method, for the end of interval, use a mask to not count twice 
    # intervals that are fully included in one interval of B
    
    # then merge and calcualte the multiplication you are after
    df_ = dfB[['cat', 'deltaB']].merge(dfA[['cat','deltaAStart', 'deltaAEnd']])
    s = (df_['deltaB'].to_numpy()[:,None]*df_[['deltaAStart', 'deltaAEnd']]).sum().sum()
    print (s)
    #-425.0