# Julia Groupby with mean calculation

I have this dataframe:

``````d=DataFrame(class=["A","A","A","B","C","D","D","D"],
num=[10,20,30,40,20,20,13,12],
last=[3,5,7,9,11,13,100,12])
``````

and I want to do a groupby. In Python I would do:

``````d.groupby('class')[['num','last']].mean()
``````

How can I do the same in Julia?

I am trying something to use `combine` and `groupby` but no success so far.

Update: I managed to do it this way:

``````gd = groupby(d, :class)
combine(gd, :num => mean, :last => mean)
``````

Is there any better way to do it?

Solution

• It depends what you mean by "a better way". You can apply the same function to multiple columns like this:

``````combine(gd, [:num, :last] .=> mean)
``````

or if you had a lot of columns and e.g. wanted to apply `mean` to all columns exept a grouping column you could do:

``````combine(gd, Not(:class) .=> mean)
``````

or (if you want to avoid having to remember which column was grouping)

``````combine(gd, valuecols(gd) .=> mean)
``````

These are basic schemas. Now the other issue is how to give a name to your target columns. By default they get a name in a form `"source_function"` like this:

``````julia> combine(gd, [:num, :last] .=> mean)
4×3 DataFrame
Row │ class   num_mean  last_mean
│ String  Float64   Float64
─────┼─────────────────────────────
1 │ A           20.0     5.0
2 │ B           40.0     9.0
3 │ C           20.0    11.0
4 │ D           15.0    41.6667
``````

you can keep original column names like this (this is sometimes preferred):

``````julia> combine(gd, [:num, :last] .=> mean, renamecols=false)
4×3 DataFrame
Row │ class   num      last
│ String  Float64  Float64
─────┼──────────────────────────
1 │ A          20.0   5.0
2 │ B          40.0   9.0
3 │ C          20.0  11.0
4 │ D          15.0  41.6667
``````

or like this:

``````julia> combine(gd, [:num, :last] .=> mean .=> identity)
4×3 DataFrame
Row │ class   num      last
│ String  Float64  Float64
─────┼──────────────────────────
1 │ A          20.0   5.0
2 │ B          40.0   9.0
3 │ C          20.0  11.0
4 │ D          15.0  41.6667
``````

The last example shows you that you can pass any function as the last part that works on strings and generates you target column name, so you can do:

``````julia> combine(gd, [:num, :last] .=> mean .=> col -> "prefix_" * uppercase(col) * "_suffix")
4×3 DataFrame
Row │ class   prefix_NUM_suffix  prefix_LAST_suffix
│ String  Float64            Float64
─────┼───────────────────────────────────────────────
1 │ A                    20.0              5.0
2 │ B                    40.0              9.0
3 │ C                    20.0             11.0
4 │ D                    15.0             41.6667
``````

### Edit

Doing the operation in a single line:

You can do just:

``````combine(groupby(d, :class), [:num, :last] .=> mean)
``````

The benefit of storing `groupby(d, :class)` in a variable is that you perform grouping once and then can reuse the resulting object many times, which speeds up things.

Also if you use DataFrmesMeta.jl you could write e.g.:

``````@chain d begin
groupby(:class)
combine([:num, :last] .=> mean)
end
``````

which is more typing, but this is style that people coming from R tend to like.