pythonpandasdataframegroup-bypandas-apply

How do I keep values based on dataframe values?


I have the following dataframe.

ID   path1   path2   path3
1    12      NaN      NaN
1    1       5        NaN
1    2       NaN       ''
1    2       4        111
2    123     NaN      NaN
3    11      25       NaN
3    1       NaN      NaN 
3    21      34       NaN
3    NaN     NaN      NaN

I want to update column values based on ID if there are NaN values. first priority is path1, path2 and path3 have values then keep. check which column has more results keep that column.

What is the best way to get the result of the dataframe below?

ID   path1   path2   path3
1    2       4       111
2    123     NaN     NaN  
3    11      25      NaN
3    21      34      NaN

Solution

  • Group the dataframe by ID then get all the records which have maximum number of non NaN values in the rows by applying a function to the group.

    >>> (df.groupby('ID')
        .apply(lambda x: x.loc[x.notna().sum(axis=1).max() == x.notna().sum(axis=1)])
        .reset_index(drop=True))
    
       ID  path1  path2 path3
    0   1    2.0    4.0   111
    1   2  123.0    NaN   NaN
    2   3   11.0   25.0   NaN
    3   3   21.0   34.0   NaN
    

    Here is slightly modified version as above code creates mask and calls sum twice, it can be avoided by use of normal function than a lambda function:

    def get_rows(df):
        counts = df.notna().sum(axis=1)
        return df.loc[counts==counts.max()]
    
    
    df.groupby('ID').apply(get_rows).reset_index(drop=True)
       ID  path1  path2 path3
    0   1    2.0    4.0   111
    1   2  123.0    NaN   NaN
    2   3   11.0   25.0   NaN
    3   3   21.0   34.0   NaN