pandasdataframedata-sciencedata-munging

dataframe get rows between certain value to a certain value in previous row


I have a dataframe:

df = c1 c2 c3 code
     1. 2. 3. 200
     1. 5. 7. 220
     1. 2. 3. 200
     2. 4. 1. 340
     6. 1. 1. 370
     6. 1. 5. 270
     9. 8. 2. 300
     1. 6. 9. 700
     9. 2. 1. 200 
     8. 1. 2  400
     1. 2  1. 200
     2. 5. 3  900
     8. 0. 4. 300
     9. 1. 2. 620

I want to take only the rows that are between any row with 300 code to its previous 200 code. So here I will have

df.  c1 c2 c3 code batch_num
     1. 2. 3. 200.   0
     2. 4. 1. 340.   0
     6. 1. 1. 370.   0 
     6. 1. 5. 270.   0
     9. 8. 2. 300.   0
     1. 2  1. 200.   1
     2. 5. 3  900.   1
     8. 0. 4. 300.   1

So basically what I need is to: find each 300, and for each - find the nearest previous 200, and take the rows between them. It is guaranteed that there will always be at least one 200 before each 300. Than, add a columns that indicate the proper batch. How can I do it efficiently in pandas?


Solution

  • You can use:

    # rows after 200 are True
    m1 = df['code'].map({200: True, 300: False}).ffill()
    # rows before 300 are True
    m2 = df['code'].map({300: True, 200: False}).bfill()
    
    # if both conditions True, expand to ± 1
    m = (m1&m2).rolling(3, min_periods=1, center=True).max().astype(bool)
    
    # select
    out = df[m]
    
    # add batch
    out['batch_num'] = out['code'].eq(200).cumsum().sub(1)
    

    output:

         c1   c2   c3  code  batch_num
    2   1.0  2.0  3.0   200          0
    3   2.0  4.0  1.0   340          0
    4   9.0  8.0  2.0   300          0
    8   1.0  2.0  1.0   200          1
    9   2.0  5.0  3.0   900          1
    10  8.0  0.0  4.0   300          1
    

    Intermediates:

         c1   c2   c3  code     m1     m2  m1&m2  rolling_max
    0   1.0  2.0  3.0   200   True  False  False        False
    1   1.0  5.0  7.0   220   True  False  False        False
    2   1.0  2.0  3.0   200   True  False  False         True
    3   2.0  4.0  1.0   340   True   True   True         True
    4   9.0  8.0  2.0   300  False   True  False         True
    5   1.0  6.0  9.0   700  False  False  False        False
    6   9.0  2.0  1.0   200   True  False  False        False
    7   8.0  1.0  2.0   400   True  False  False        False
    8   1.0  2.0  1.0   200   True  False  False         True
    9   2.0  5.0  3.0   900   True   True   True         True
    10  8.0  0.0  4.0   300  False   True  False         True
    11  9.0  1.0  2.0   620  False    NaN  False        False
    

    faster approach without rolling if only one row before/after is needed:

    m = m1&m2
    m = m|m.shift()|m.shift(-1)
    
    out = df[m]
    out['batch_num'] = out['code'].eq(200).cumsum().sub(1)