This is my DataFrame:
import pandas as pd
df = pd.DataFrame({
'a': [10, 20, 30, 50, 50, 50, 4, 100],
'b': [30, 3, 200, 25, 24, 31, 29, 2],
'd': list('aaabbbcc')
})
Expected output:
a b d
0 10 30 a
1 20 3 a
2 30 200 a
The grouping is by column d
. I want to return the groups that have at least two instances of this mask
m = (df.b.gt(df.a))
This is what I have tried. It works but I wonder if there is a better/more efficient way to do it.
out = df.groupby('d').filter(lambda x: len(x.loc[x.b.gt(x.a)]) >= 2)
You could use a groupby.transform
on the mask with sum
to produce a boolean Series:
m = df['b'].gt(df['a'])
out = df[m.groupby(df['d']).transform('sum').ge(2)]
Output:
a b d
0 10 30 a
1 20 3 a
2 30 200 a
Intermediates:
a b d m transform('sum') ge(2)
0 10 30 a True 2 True
1 20 3 a False 2 True
2 30 200 a True 2 True
3 50 25 b False 0 False
4 50 24 b False 0 False
5 50 31 b False 0 False
6 4 29 c True 1 False
7 100 2 c False 1 False
Alternative:
counts = m.groupby(df['d']).sum()
out = df[df['d'].isin(counts.index[counts>=2])]
Alternatively, one could avoid the costly groupby
with pure numpy.
This first approach with add.reduceat
requires the groups to be consecutive:
groups = df['d'].ne(df['d'].shift()).values
m = df['b'].gt(df['a']).values
idx = np.nonzero(groups)[0]
out = df[df['d'].isin(df['d'].iloc[idx[np.add.reduceat(m, idx)>=2]])]
This second one with pandas.factorize
and numpy.bincount
would work even with shuffled groups:
a, idx = pd.factorize(df['d'])
out = df[df['d'].isin(idx[np.bincount(a, weights=m) >= 2])]
Intermediates:
## reduceat approach
groups = df['d'].ne(df['d'].shift()).values
# array([ True, False, False, True, False, False, True, False])
m = df['b'].gt(df['a']).values
# array([ True, False, True, False, False, False, True, False])
idx = np.nonzero(groups)[0]
# array([0, 3, 6])
np.add.reduceat(m, idx)>=2
# array([ True, False, False])
idx[np.add.reduceat(m, idx)>=2]
# array([0])
df['d'].iloc[idx[np.add.reduceat(m, idx)>=2]]
# ['a']
df['d'].isin(df['d'].iloc[idx[np.add.reduceat(m, idx)>=2]])
# array([ True, True, True, False, False, False, False, False])
## bincount approach
a, idx = pd.factorize(df['d'])
a
# array([0, 0, 0, 1, 1, 1, 2, 2])
idx
# Index(['a', 'b', 'c'], dtype='object')
np.bincount(a, weights=m)
# array([2., 0., 1.])
np.bincount(a, weights=m) >= 2
# array([ True, False, False])
idx[np.bincount(a, weights=m) >= 2]
# Index(['a'], dtype='object')
df['d'].isin(idx[np.bincount(a, weights=m) >= 2])
# array([ True, True, True, False, False, False, False, False])
With groups of 3 rows (sorted members):
With groups of 3 rows (shuffled members; NB. excluding reduceat
.):
With a fixed number of 20 groups (of about equal size) with consecutive members:
With a fixed number of 20 groups (of about equal size) with shuffled members (NB. excluding reduceat
.):