I have this dataframe:
d=DataFrame(class=["A","A","A","B","C","D","D","D"],
num=[10,20,30,40,20,20,13,12],
last=[3,5,7,9,11,13,100,12])
and I want to do a groupby. In Python I would do:
d.groupby('class')[['num','last']].mean()
How can I do the same in Julia?
I am trying something to use combine
and groupby
but no success so far.
Update: I managed to do it this way:
gd = groupby(d, :class)
combine(gd, :num => mean, :last => mean)
Is there any better way to do it?
It depends what you mean by "a better way". You can apply the same function to multiple columns like this:
combine(gd, [:num, :last] .=> mean)
or if you had a lot of columns and e.g. wanted to apply mean
to all columns exept a grouping column you could do:
combine(gd, Not(:class) .=> mean)
or (if you want to avoid having to remember which column was grouping)
combine(gd, valuecols(gd) .=> mean)
These are basic schemas. Now the other issue is how to give a name to your target columns. By default they get a name in a form "source_function"
like this:
julia> combine(gd, [:num, :last] .=> mean)
4×3 DataFrame
Row │ class num_mean last_mean
│ String Float64 Float64
─────┼─────────────────────────────
1 │ A 20.0 5.0
2 │ B 40.0 9.0
3 │ C 20.0 11.0
4 │ D 15.0 41.6667
you can keep original column names like this (this is sometimes preferred):
julia> combine(gd, [:num, :last] .=> mean, renamecols=false)
4×3 DataFrame
Row │ class num last
│ String Float64 Float64
─────┼──────────────────────────
1 │ A 20.0 5.0
2 │ B 40.0 9.0
3 │ C 20.0 11.0
4 │ D 15.0 41.6667
or like this:
julia> combine(gd, [:num, :last] .=> mean .=> identity)
4×3 DataFrame
Row │ class num last
│ String Float64 Float64
─────┼──────────────────────────
1 │ A 20.0 5.0
2 │ B 40.0 9.0
3 │ C 20.0 11.0
4 │ D 15.0 41.6667
The last example shows you that you can pass any function as the last part that works on strings and generates you target column name, so you can do:
julia> combine(gd, [:num, :last] .=> mean .=> col -> "prefix_" * uppercase(col) * "_suffix")
4×3 DataFrame
Row │ class prefix_NUM_suffix prefix_LAST_suffix
│ String Float64 Float64
─────┼───────────────────────────────────────────────
1 │ A 20.0 5.0
2 │ B 40.0 9.0
3 │ C 20.0 11.0
4 │ D 15.0 41.6667
Doing the operation in a single line:
You can do just:
combine(groupby(d, :class), [:num, :last] .=> mean)
The benefit of storing groupby(d, :class)
in a variable is that you perform grouping once and then can reuse the resulting object many times, which speeds up things.
Also if you use DataFrmesMeta.jl you could write e.g.:
@chain d begin
groupby(:class)
combine([:num, :last] .=> mean)
end
which is more typing, but this is style that people coming from R tend to like.