pythonpandasgroup-bymean

Mean value with special dependency


I have a DataFrame that looks something like this:

C1 C2
10 10
20 10
30 16
5 23
6 23
8 10
4 10
2 10

I would like to calculate the mean value in column C1 depending on the values in column C2. The mean value is to be calculated over all values in column C1 until the value in column C2 changes again. The result table should look like this:

C1 C2
15 10
30 16
5.5 23
4.67 10

Solution

  • Use GroupBy.mean by consecutive values created by compared Series.shifted values with Series.cumsum, last remove first level and get original order of columns by DataFrame.reindex:

    out =(df.groupby([df['C2'].ne(df['C2'].shift()).cumsum(),'C2'],sort=False)['C1']
            .mean()
            .droplevel(0)
            .reset_index()
            .reindex(df.columns, axis=1))
    print (out)
              C1  C2
    0  15.000000  10
    1  30.000000  16
    2   5.500000  23
    3   4.666667  10
    

    How it working:

    print (df.assign(compared=df['C2'].ne(df['C2'].shift()),
                     cumsum=df['C2'].ne(df['C2'].shift()).cumsum()))
       C1  C2  compared  cumsum
    0  10  10      True       1
    1  20  10     False       1
    2  30  16      True       2
    3   5  23      True       3
    4   6  23     False       3
    5   8  10      True       4
    6   4  10     False       4
    7   2  10     False       4
    

    Thank you @ouroboros1 for another easier solution with GroupBy.agg:

    out = (df.groupby(df['C2'].ne(df['C2'].shift()).cumsum(), as_index=False)
             .agg({'C1': 'mean', 'C2': 'first'}))
    print (out)
              C1  C2
    0  15.000000  10
    1  30.000000  16
    2   5.500000  23
    3   4.666667  10