pythonpandaspandas-groupby

Replacing values with groupby means


I have a DataFrame with a column that has some bad data with various negative values. I would like to replace values < 0 with the mean of the group that they are in.

For missing values as NAs, I would do:

data = df.groupby(['GroupID']).column
data.transform(lambda x: x.fillna(x.mean()))

But how to do this operation on a condition like x < 0?

Thanks!


Solution

  • Using @AndyHayden's example, you could use groupby/transform with replace:

    df = pd.DataFrame([[1,1],[1,-1],[2,1],[2,2]], columns=list('ab'))
    print(df)
    #    a  b
    # 0  1  1
    # 1  1 -1
    # 2  2  1
    # 3  2  2
    
    data = df.groupby(['a'])
    def replace(group):
        mask = group<0
        # Select those values where it is < 0, and replace
        # them with the mean of the values which are not < 0.
        group[mask] = group[~mask].mean()
        return group
    print(data.transform(replace))
    #    b
    # 0  1
    # 1  1
    # 2  1
    # 3  2