pythonpandas

How to do an advanced grouping in pandas?


The easiest way is to demonstrate my question with an example. Suppose I have the following long format data frame

In [284]: import pandas as pd

In [285]: data = pd.DataFrame({"day": [0,0,0,0,0,0,1,1,1,1,1,1], "cat1": ["A", "A", "A", "B", "B", "B", "A", "A", "B", "B", "B", "B"], "cat2":["1", "1", "2", "1", "2", "2", "1", "2", "1", "1", "2", "2"], "value": [10, 230, 32,12, 12, 65, 12, 34, 97, 0, 12,1]})

In [286]: data
Out[286]: 
    day cat1 cat2  value
0     0    A    1     10
1     0    A    1    230
2     0    A    2     32
3     0    B    1     12
4     0    B    2     12
5     0    B    2     65
6     1    A    1     12
7     1    A    2     34
8     1    B    1     97
9     1    B    1      0
10    1    B    2     12
11    1    B    2      1

Per day I have two categories. My goal is to aggregate the cat2 category in a specific way. For each tuple (date, cat1, cat2) I would like to perform the following:

In [287]: data_day = data[data["day"]==0]

In [288]: data_day_cat1 = data_day[data_day["cat1"]=="A"]

In [289]: data_day_cat1_cat2 = data_day_cat1[data_day_cat1["cat2"]=="1"]

In [290]: data_day_cat1_cat2["value"].pow(2).mean()
Out[290]: np.float64(26500.0)

In [291]: data_day_cat1_cat2 = data_day_cat1[data_day_cat1["cat2"]=="2"]

In [292]: data_day_cat1_cat2["value"].pow(2).mean()
Out[292]: np.float64(1024.0)

That is on the first day, for cat1 being A, I want a single line for all occurrence of cat2, where the latter is like a "root mean square error". Currently I'm looping over all combination, but I was playing around with using groupby. However, something like:

data.groupby(["day", "cat1", "cat2"])["value"].apply(lambda x: x**2).mean()

Does work. What I would like to get is a DataFrame like this:

    day cat1 cat2  value
0     0    A    1  26500
1     0    A    2   1024

EDIT: Note, I want the complete DataFrame, was just too lazy to write down the whole data frame.

Is this possible without looping over all day, cat1 and cat2? Could groupby be used?


Solution

  • You can create a new column with the square value and then do the groupby:

    data["value2"] = data["value"] * data["value"]  
    gb = data.groupby(["day", "cat1", "cat2"])["value2"].mean()
    display(gb)
    

    res