pandas

How to delete undesirable rows in each group based on two conditions?


My input is a dataframe :

df = pd.DataFrame({'col1': ['A', 'A', 'B', 'C', 'D', 'D', 'D', 'E', 'E'],
 'col2': ['x1', 'x2', 'x1', 'x1', 'x1', 'x2', 'x3', 'x1', 'x2'],
 'col3': ['', 'mssg1', 'mssg2', '', 'mssg3', '', 'mssg4', '', '']})

  col1 col2   col3
0    A   x1       
1    A   x2  mssg1
2    B   x1  mssg2
3    C   x1       
4    D   x1  mssg3
5    D   x2       
6    D   x3  mssg4
7    E   x1       
8    E   x2       

I need to loop over each group in the column col1 and see if it's a duplicate and if True, go check the column col3 and keep only the non empty values. If all values are empty, then keep them all.

For that I made the code below. at first sight, it works but not only it is slow on my real dataset but I also ignore if it misses something in the logic.

def clean(x):
    if len(x) > 1 and x['col3'].notna().any():
        return x.loc[x['col3'].notna()]
    else:
        return x

final = df.replace('', None).groupby('col1', as_index=False).apply(clean).droplevel(0).fillna('')

My expect output is this :


  col1 col2   col3
1    A   x2  mssg1
2    B   x1  mssg2
3    C   x1       
4    D   x1  mssg3
6    D   x3  mssg4
7    E   x1       
8    E   x2       

Can you guys show me how to solve my problem ?


Solution

  • Use boolean indexing:

    # is col3 not empty?
    m1 = df['col3'].ne('')
    # is m1 not all True per col1?
    m2 = ~m1.groupby(df['col1']).transform('any')
    
    out = df[m1|m2]
    

    Output:

      col1 col2   col3
    1    A   x2  mssg1
    2    B   x1  mssg2
    3    C   x1       
    4    D   x1  mssg3
    6    D   x3  mssg4
    7    E   x1       
    8    E   x2