pythonpandasdataframe

What is the best way to filter the groups that have at least N rows that meets the conditions of a mask?


This is my DataFrame:

import pandas as pd
df = pd.DataFrame({
    'a': [10, 20, 30, 50, 50, 50, 4, 100],
    'b': [30, 3, 200, 25, 24, 31, 29, 2],
    'd': list('aaabbbcc')
})

Expected output:

    a    b  d
0  10   30  a
1  20    3  a
2  30  200  a

The grouping is by column d. I want to return the groups that have at least two instances of this mask

m = (df.b.gt(df.a))

This is what I have tried. It works but I wonder if there is a better/more efficient way to do it.

out = df.groupby('d').filter(lambda x: len(x.loc[x.b.gt(x.a)]) >= 2)

Solution

  • With

    You could use a groupby.transform on the mask with sum to produce a boolean Series:

    m = df['b'].gt(df['a'])
    out = df[m.groupby(df['d']).transform('sum').ge(2)]
    

    Output:

        a    b  d
    0  10   30  a
    1  20    3  a
    2  30  200  a
    

    Intermediates:

         a    b  d      m  transform('sum')  ge(2)
    0   10   30  a   True                 2   True
    1   20    3  a  False                 2   True
    2   30  200  a   True                 2   True
    3   50   25  b  False                 0  False
    4   50   24  b  False                 0  False
    5   50   31  b  False                 0  False
    6    4   29  c   True                 1  False
    7  100    2  c  False                 1  False
    

    Alternative:

    counts = m.groupby(df['d']).sum()
    
    out = df[df['d'].isin(counts.index[counts>=2])]
    

    With

    Alternatively, one could avoid the costly groupby with pure numpy.

    This first approach with add.reduceat requires the groups to be consecutive:

    groups = df['d'].ne(df['d'].shift()).values
    m = df['b'].gt(df['a']).values
    idx = np.nonzero(groups)[0]
    out =  df[df['d'].isin(df['d'].iloc[idx[np.add.reduceat(m, idx)>=2]])]
    

    This second one with pandas.factorize and numpy.bincount would work even with shuffled groups:

    a, idx = pd.factorize(df['d'])
    out = df[df['d'].isin(idx[np.bincount(a, weights=m) >= 2])]
    

    Intermediates:

    ## reduceat approach
    groups = df['d'].ne(df['d'].shift()).values
    # array([ True, False, False,  True, False, False,  True, False])
    m = df['b'].gt(df['a']).values
    # array([ True, False,  True, False, False, False,  True, False])
    idx = np.nonzero(groups)[0]
    # array([0, 3, 6])
    np.add.reduceat(m, idx)>=2
    # array([ True, False, False])
    idx[np.add.reduceat(m, idx)>=2]
    # array([0])
    df['d'].iloc[idx[np.add.reduceat(m, idx)>=2]]
    # ['a']
    df['d'].isin(df['d'].iloc[idx[np.add.reduceat(m, idx)>=2]])
    # array([ True,  True,  True, False, False, False, False, False])
    
    ## bincount approach
    a, idx = pd.factorize(df['d'])
    a
    # array([0, 0, 0, 1, 1, 1, 2, 2])
    idx
    # Index(['a', 'b', 'c'], dtype='object')
    np.bincount(a, weights=m)
    # array([2., 0., 1.])
    np.bincount(a, weights=m) >= 2
    # array([ True, False, False])
    idx[np.bincount(a, weights=m) >= 2]
    # Index(['a'], dtype='object')
    df['d'].isin(idx[np.bincount(a, weights=m) >= 2])
    # array([ True,  True,  True, False, False, False, False, False])
    

    Timings

    With groups of 3 rows (sorted members):

    pandas numpy filter groups by number of True conditions

    With groups of 3 rows (shuffled members; NB. excluding reduceat.):

    pandas numpy filter groups by number of True conditions

    With a fixed number of 20 groups (of about equal size) with consecutive members:

    pandas numpy filter groups by number of True conditions

    With a fixed number of 20 groups (of about equal size) with shuffled members (NB. excluding reduceat.):

    pandas numpy filter groups by number of True conditions