dataframejuliasplit-apply-combinedataframes.jl

Combine grouped DF in Julia with Floats and Strings


I have a bunch of Grouped DataFrames gdf that I want to combine. I want to combine the GDF with the mean var1 which is a Float and the first element of var2 which is a String.

I tried

combine(gdf, :var1 .=> mean, :var2 .=> first(:var2))

But getting the error ERROR: MethodError: no method matching iterate(::Symbol) I also tried first(:var2, 1) .

Thanks for any help.


Solution

  • This is the way to do it with DataFrames.jl:

    julia> using DataFrames
    
    julia> using Statistics
    
    julia> df = DataFrame(id=[1,2,1,2,1,2], var1=1.5:1:6.5, var2=string.(1:6))
    6×3 DataFrame
     Row │ id     var1     var2
         │ Int64  Float64  String
    ─────┼────────────────────────
       1 │     1      1.5  1
       2 │     2      2.5  2
       3 │     1      3.5  3
       4 │     2      4.5  4
       5 │     1      5.5  5
       6 │     2      6.5  6
    
    julia> gdf = groupby(df, :id)
    GroupedDataFrame with 2 groups based on key: id
    First Group (3 rows): id = 1
     Row │ id     var1     var2
         │ Int64  Float64  String
    ─────┼────────────────────────
       1 │     1      1.5  1
       2 │     1      3.5  3
       3 │     1      5.5  5
    ⋮
    Last Group (3 rows): id = 2
     Row │ id     var1     var2
         │ Int64  Float64  String
    ─────┼────────────────────────
       1 │     2      2.5  2
       2 │     2      4.5  4
       3 │     2      6.5  6
    
    julia> combine(gdf, :var1 => mean, :var2 => first)
    2×3 DataFrame
     Row │ id     var1_mean  var2_first
         │ Int64  Float64    String
    ─────┼──────────────────────────────
       1 │     1        3.5  1
       2 │     2        4.5  2
    

    (there is no need of . before => and no need to pass argument to first explicitly)

    If you would prefer to use assignment style (instead of functional style with => pairs) use DataFramesMeta.jl:

    julia> using DataFramesMeta
    
    julia> @combine(gdf, :var1_mean=mean(:var1), :var2_first=first(:var2))
    2×3 DataFrame
     Row │ id     var1_mean  var2_first
         │ Int64  Float64    String
    ─────┼──────────────────────────────
       1 │     1        3.5  1
       2 │     2        4.5  2