My DataFrame is:
import pandas as pd
df = pd.DataFrame(
{
'a': [False, True, False, True, False, True, False, True, True, False, False],
}
)
Expected output is forming groups like this:
a
1 True
2 False
3 True
a
5 True
6 False
7 True
a
8 True
9 False
10 False
The logic is:
Basically I want to form groups where df.a == True
and two rows after that. For example, in order to create the first group, the first True
should be found which is row 1
. Then the first group is rows 1, 2 and 3. For the second group the next True
must be found which is not in the first group. That row is row 5
. So the second group is consisted of rows 5, 6 and 7. This image clarifies the point:
And this is my attempt that didn't work:
N = 2
mask = ((df.a.eq(True))
.cummax().cumsum()
.between(1, N+1)
)
out = df[mask]
Since you problem is inherently iterative, you must loop.
The most straightforward option IMO is to use a simple python loop:
def split(df, N=2):
i = 0
a = df['a'].to_numpy()
while i < len(df):
if a[i]:
yield df.iloc[i:i+N+1]
i+=N
i+=1
out = list(split(df))
Output:
[ a
1 True
2 False
3 True,
a
5 True
6 False
7 True,
a
8 True
9 False
10 False]
If you want a simple mask and a unique DataFrame as output, you could improve the speed with numba
:
from numba import jit
@jit(nopython=True)
def get_indices(a, N=2):
i = 0
out = []
while i < len(a):
if a[i]:
out.extend([True]*(N+1))
i += N+1
else:
out.append(False)
i += 1
return out[:len(a)]
out = df.loc[get_indices(df['a'].to_numpy())]
out['group'] = np.arange(len(out))//3
Output:
a group
1 True 0
2 False 0
3 True 0
5 True 1
6 False 1
7 True 1
8 True 2
9 False 2
10 False 2
N = 2
relative timings:
N = 100
NB. @Triky's solution is not providing correct results here
relative timings