pythonpandasgroup-bynan

pandas groupby expanding mean does not accept missing values


I've been looking to retrieve group-based expanding means from the following dataset:

df = pd.DataFrame({'id':[1,1,1,2,2,2],'y':[1,2,3,1,2,3]})

and df.groupby('id').expanding().mean().values returns the correct:

array([[1. ],
       [1.5],
       [2. ],
       [1. ],
       [1.5],
       [2. ]])

However, in my specific case I have to deal with some missing values as well, so that:

df2 = pd.DataFrame({'id':[1,1,1,2,2,2],'y':[1,pd.NA,3,1,2,3]})

My expected result applying the same logic would be to ignore the NaN in the computation of the mean, so that from df2.groupby('id').expanding().mean().values I would expect

array([[1. ],
       [1.],
       [2. ],
       [1. ],
       [1.5],
       [2. ]])

Instead, Pandas returns an error due to applying some type assertion to float in the backend. None of my naive attempts (e.g., .expanding().apply(lambda x: np.nansum(x)) are solving this. Any (possibly equally compact) solution?


Solution

  • You can convert column 'y' with pd.to_numeric, which will coerce pd.NaN into nan. The latter can be interpreted correctly by the following operations:

    df2["y"] = pd.to_numeric(df2["y"])
    df2 = df2.groupby("id").expanding().mean().values
    
    [[1. ]
     [1. ]
     [2. ]
     [1. ]
     [1.5]
     [2. ]]