The easiest way is to demonstrate my question with an example. Suppose I have the following long format data frame
In [284]: import pandas as pd
In [285]: data = pd.DataFrame({"day": [0,0,0,0,0,0,1,1,1,1,1,1], "cat1": ["A", "A", "A", "B", "B", "B", "A", "A", "B", "B", "B", "B"], "cat2":["1", "1", "2", "1", "2", "2", "1", "2", "1", "1", "2", "2"], "value": [10, 230, 32,12, 12, 65, 12, 34, 97, 0, 12,1]})
In [286]: data
Out[286]:
day cat1 cat2 value
0 0 A 1 10
1 0 A 1 230
2 0 A 2 32
3 0 B 1 12
4 0 B 2 12
5 0 B 2 65
6 1 A 1 12
7 1 A 2 34
8 1 B 1 97
9 1 B 1 0
10 1 B 2 12
11 1 B 2 1
Per day I have two categories. My goal is to aggregate the cat2
category in a specific way. For each tuple (date
, cat1
, cat2
) I would like to perform the following:
In [287]: data_day = data[data["day"]==0]
In [288]: data_day_cat1 = data_day[data_day["cat1"]=="A"]
In [289]: data_day_cat1_cat2 = data_day_cat1[data_day_cat1["cat2"]=="1"]
In [290]: data_day_cat1_cat2["value"].pow(2).mean()
Out[290]: np.float64(26500.0)
In [291]: data_day_cat1_cat2 = data_day_cat1[data_day_cat1["cat2"]=="2"]
In [292]: data_day_cat1_cat2["value"].pow(2).mean()
Out[292]: np.float64(1024.0)
That is on the first day, for cat1
being A
, I want a single line for all occurrence of cat2
, where the latter is like a "root mean square error". Currently I'm looping over all combination, but I was playing around with using groupby
. However, something like:
data.groupby(["day", "cat1", "cat2"])["value"].apply(lambda x: x**2).mean()
Does work. What I would like to get is a DataFrame like this:
day cat1 cat2 value
0 0 A 1 26500
1 0 A 2 1024
EDIT: Note, I want the complete DataFrame, was just too lazy to write down the whole data frame.
Is this possible without looping over all day
, cat1
and cat2
? Could groupby be used?
You can create a new column with the square value and then do the groupby:
data["value2"] = data["value"] * data["value"]
gb = data.groupby(["day", "cat1", "cat2"])["value2"].mean()
display(gb)