What is the best way to filter the groups that have at least N rows that meets the conditions of a mask?

This is my DataFrame:

import pandas as pd
df = pd.DataFrame({
    'a': [10, 20, 30, 50, 50, 50, 4, 100],
    'b': [30, 3, 200, 25, 24, 31, 29, 2],
    'd': list('aaabbbcc')
})

Expected output:

    a    b  d
0  10   30  a
1  20    3  a
2  30  200  a

The grouping is by column d. I want to return the groups that have at least two instances of this mask

m = (df.b.gt(df.a))

This is what I have tried. It works but I wonder if there is a better/more efficient way to do it.

out = df.groupby('d').filter(lambda x: len(x.loc[x.b.gt(x.a)]) >= 2)

Solution

With pandas

You could use a groupby.transform on the mask with sum to produce a boolean Series:

m = df['b'].gt(df['a'])
out = df[m.groupby(df['d']).transform('sum').ge(2)]

Output:

    a    b  d
0  10   30  a
1  20    3  a
2  30  200  a

Intermediates:

     a    b  d      m  transform('sum')  ge(2)
0   10   30  a   True                 2   True
1   20    3  a  False                 2   True
2   30  200  a   True                 2   True
3   50   25  b  False                 0  False
4   50   24  b  False                 0  False
5   50   31  b  False                 0  False
6    4   29  c   True                 1  False
7  100    2  c  False                 1  False

Alternative:

counts = m.groupby(df['d']).sum()

out = df[df['d'].isin(counts.index[counts>=2])]

With numpy

Alternatively, one could avoid the costly groupby with pure numpy.

This first approach with add.reduceat requires the groups to be consecutive:

groups = df['d'].ne(df['d'].shift()).values
m = df['b'].gt(df['a']).values
idx = np.nonzero(groups)[0]
out =  df[df['d'].isin(df['d'].iloc[idx[np.add.reduceat(m, idx)>=2]])]

This second one with pandas.factorize and numpy.bincount would work even with shuffled groups:

a, idx = pd.factorize(df['d'])
out = df[df['d'].isin(idx[np.bincount(a, weights=m) >= 2])]

Intermediates:

## reduceat approach
groups = df['d'].ne(df['d'].shift()).values
# array([ True, False, False,  True, False, False,  True, False])
m = df['b'].gt(df['a']).values
# array([ True, False,  True, False, False, False,  True, False])
idx = np.nonzero(groups)[0]
# array([0, 3, 6])
np.add.reduceat(m, idx)>=2
# array([ True, False, False])
idx[np.add.reduceat(m, idx)>=2]
# array([0])
df['d'].iloc[idx[np.add.reduceat(m, idx)>=2]]
# ['a']
df['d'].isin(df['d'].iloc[idx[np.add.reduceat(m, idx)>=2]])
# array([ True,  True,  True, False, False, False, False, False])

## bincount approach
a, idx = pd.factorize(df['d'])
a
# array([0, 0, 0, 1, 1, 1, 2, 2])
idx
# Index(['a', 'b', 'c'], dtype='object')
np.bincount(a, weights=m)
# array([2., 0., 1.])
np.bincount(a, weights=m) >= 2
# array([ True, False, False])
idx[np.bincount(a, weights=m) >= 2]
# Index(['a'], dtype='object')
df['d'].isin(idx[np.bincount(a, weights=m) >= 2])
# array([ True,  True,  True, False, False, False, False, False])

Timings

With groups of 3 rows (sorted members):

With groups of 3 rows (shuffled members; NB. excluding reduceat.):

With a fixed number of 20 groups (of about equal size) with consecutive members:

With a fixed number of 20 groups (of about equal size) with shuffled members (NB. excluding reduceat.):