pythonpandasdataframe

How can I form groups by a mask and N rows after that mask?


My DataFrame is:

import pandas as pd
df = pd.DataFrame(
    {
        'a': [False, True, False, True, False, True, False, True, True, False, False],
    }
)

Expected output is forming groups like this:

        a
1    True
2   False
3    True

        a
5    True
6   False
7    True

      a
8    True
9   False
10  False

The logic is:

Basically I want to form groups where df.a == True and two rows after that. For example, in order to create the first group, the first True should be found which is row 1. Then the first group is rows 1, 2 and 3. For the second group the next True must be found which is not in the first group. That row is row 5. So the second group is consisted of rows 5, 6 and 7. This image clarifies the point:

enter image description here

And this is my attempt that didn't work:

N = 2
mask = ((df.a.eq(True))
        .cummax().cumsum()
        .between(1, N+1)
        )

out = df[mask]

Solution

  • Since you problem is inherently iterative, you must loop.

    The most straightforward option IMO is to use a simple python loop:

    def split(df, N=2):
        i = 0
        a = df['a'].to_numpy()
        while i < len(df):
            if a[i]:
                yield df.iloc[i:i+N+1]
                i+=N
            i+=1
    
    out = list(split(df))
    

    Output:

    [       a
     1   True
     2  False
     3   True,
            a
     5   True
     6  False
     7   True,
             a
     8    True
     9   False
     10  False]
    

    If you want a simple mask and a unique DataFrame as output, you could improve the speed with numba:

    from numba import jit
    
    @jit(nopython=True)
    def get_indices(a, N=2):
        i = 0
        out = []
        while i < len(a):
            if a[i]:
                out.extend([True]*(N+1))
                i += N+1
            else:
                out.append(False)
                i += 1
        return out[:len(a)]
    
    out = df.loc[get_indices(df['a'].to_numpy())]
    out['group'] = np.arange(len(out))//3
    

    Output:

            a  group
    1    True      0
    2   False      0
    3    True      0
    5    True      1
    6   False      1
    7    True      1
    8    True      2
    9   False      2
    10  False      2
    

    Timings for N = 2

    enter image description here

    relative timings:

    enter image description here

    Timings for N = 100

    NB. @Triky's solution is not providing correct results here

    enter image description here

    relative timings

    enter image description here