I'm wondering if there is an efficient way to do the following in Julia:
I have a DataFrame of the following form:
julia> df1 = DataFrame(var1=["a","a","a","b","b","b","c","c","c"],
var2=["p","q","r","p","p","r","q","p","p"],
var3=[1,2,3,2,5,4,6,7,8])
9×3 DataFrame
│ Row │ var1 │ var2 │ var3 │
│ │ String │ String │ Int64 │
├─────┼────────┼────────┼───────┤
│ 1 │ a │ p │ 1 │
│ 2 │ a │ q │ 2 │
│ 3 │ a │ r │ 3 │
│ 4 │ b │ p │ 2 │
│ 5 │ b │ p │ 5 │
│ 6 │ b │ r │ 4 │
│ 7 │ c │ q │ 6 │
│ 8 │ c │ p │ 7 │
│ 9 │ c │ p │ 8 │
And I want to return a DataFrame that contains the same columns but only the rows where var3
has its minimum value within groups according to var1
.
I have tried using the split-apply-combine approach but couldn't seem to find a way to filter the rows while returning all columns.
Appreciate any help on this.
An alternative way to do it if you do not have duplicates in :var3
per group is:
julia> combine(sdf -> sdf[argmin(sdf.var3), :], groupby(df1, :var1))
3×3 DataFrame
Row │ var1 var2 var3
│ String String Int64
─────┼───────────────────────
1 │ a p 1
2 │ b p 2
3 │ c q 6
If you may have duplicates then use:
julia> combine(sdf -> filter(:var3 => ==(minimum(sdf.var3)), sdf), groupby(df1, :var1))
3×3 DataFrame
Row │ var1 var2 var3
│ String String Int64
─────┼───────────────────────
1 │ a p 1
2 │ b p 2
3 │ c q 6
instead.
Another example handling duplicates correctly is:
julia> combine(sdf -> first(groupby(sdf, :var3, sort=true)), groupby(df1, :var1))
3×3 DataFrame
Row │ var1 var2 var3
│ String String Int64
─────┼───────────────────────
1 │ a p 1
2 │ b p 2
3 │ c q 6
it is not very efficient in this case but shows you how you can work with groupby
in DataFrames.jl.