pythonpandasdiff

calculate diff to previous matching row in a dataframe


I have a series of timestamps (+ other data) that come from 2 separate streams of data ticking at different rates, an example below (NB: the frequency of the real data has some jitter so it's not a simple fixed stride like below)

src,idx,ts
B,1,20
A,1,100
A,2,200
A,3,300
B,2,320
A,4,400
A,5,500
A,6,600
B,3,620

for each A tick, I need to calculate the offset from the preceding B tick so it would become

src,idx,ts
A,1,80
A,2,180
A,3,280
A,4,80
A,5,180
A,6,280

how to do this in pandas without iteration?

I thought of some sort of rolling window but with a dynamic/criteria based window or some hybrid of merge_asof and group by but can't think of a way to do it.


Solution

  • You could group by changing B and subtract the first (B-row) from each group ts. Then maybe filter by not equal B to reproduce your desired final df:

    import pandas as pd
    
    df = pd.DataFrame(
        {"src": ["B", "A", "A", "A", "B", "A", "A", "A", "B"], 
         "idx": [1, 1, 2, 3, 2, 4, 5, 6, 3], 
         "ts": [20, 100, 200, 300, 320, 400, 500, 600, 620]}
    )
    
    df["ts"] -= df.groupby(df.src.eq("B").cumsum())["ts"].transform("first")
    
    df.query("src != 'B'")
    

    More detail:

    df.src.eq("B").cumsum() gives a Series which increases by one each time a "B" is encountered. This is what we want to group the DataFrame into sections between subsequent "B"s. For each group between each B (inclusive) and the following B (exclusive), we subtract the ts value at the first B position from all ts values within the group, hence resetting to zero at each B.