rdplyrgroup-byweighted-average

Collapse data frame, by group, using lists of variables for weighted average AND sum


I want to collapse the following data frame, using both summation and weighted averages, according to groups.

I have the following data frame

group_id = c(1,1,1,2,2,3,3,3,3,3)
var_1 = sample.int(20, 10)
var_2 = sample.int(20, 10)
var_percent_1 =rnorm(10,.5,.4)
var_percent_2 =rnorm(10,.5,.4)
weighting =sample.int(50, 10)

df_to_collapse = data.frame(group_id,var_1,var_2,var_percent_1,var_percent_2,weighting)

I want to collapse my data according to the groups identified by group_id. However, in my data, I have variables in absolute levels (var_1, var_2) and in percentage terms (var_percent_1, var_percent_2).

I create two lists for each type of variable (my real data is much bigger, making this necessary). I also have a weighting variable (weighting).

to_be_weighted =df_to_collapse[, 4:5]
to_be_summed = df_to_collapse[,2:3]

to_be_weighted_2=colnames(to_be_weighted)
to_be_summed_2=colnames(to_be_summed) 

And my goal is to simultaneously collapse my data using eiter sum or weighted average, according to the type of variable (ie if its in percentage terms, I use weighted average).

Here is my best attempt:

 df_to_collapse %>% group_by(group_id) %>% summarise_at(.vars = c(to_be_summed_2,to_be_weighted_2), .funs=c(sum, mean))

But, as you can see, it is not a weighted average

I have tried many different ways of using the weighted.mean fucntion, but have had no luck. Here is an example of one such attempt;

df_to_collapse %>% group_by(group_id) %>% summarise_at(.vars = c(to_be_weighted_2,to_be_summed_2), .funs=c(weighted.mean(to_be_weighted_2, weighting), sum))

And the corresponding error:

Error in weighted.mean.default(to_be_weighted_2, weighting) : 
'x' and 'w' must have the same length

Solution

  • Here's a way to do it by reshaping into long data, adding a dummy variable called type for whether it's a percentage (optional, but handy), applying a function in summarise based on whether it's a percentage, then spreading back to wide shape. If you can change column names, you could come up with a more elegant way of doing the type column, but that's really more for convenience.

    The trick for me was the type[1] == "percent"; I had to use [1] because everything in each group has the same type, but otherwise == operates over every value in the vector and gives multiple logical values, when you really just need 1.

    library(tidyverse)
    
    set.seed(1234)
    group_id = c(1,1,1,2,2,3,3,3,3,3)
    var_1 = sample.int(20, 10)
    var_2 = sample.int(20, 10)
    var_percent_1 =rnorm(10,.5,.4)
    var_percent_2 =rnorm(10,.5,.4)
    weighting =sample.int(50, 10)
    
    df_to_collapse <- data.frame(group_id,var_1,var_2,var_percent_1,var_percent_2,weighting)
    
    df_to_collapse %>%
        gather(key = var, value = value, -group_id, -weighting) %>%
        mutate(type = ifelse(str_detect(var, "percent"), "percent", "int")) %>%
        group_by(group_id, var) %>%
        summarise(sum_or_avg = ifelse(type[1] == "percent", weighted.mean(value, weighting), sum(value))) %>%
        ungroup() %>%
        spread(key = var, value = sum_or_avg)
    #> # A tibble: 3 x 5
    #>   group_id var_1 var_2 var_percent_1 var_percent_2
    #>      <dbl> <dbl> <dbl>         <dbl>         <dbl>
    #> 1        1    26    31         0.269         0.483
    #> 2        2    32    21         0.854         0.261
    #> 3        3    29    49         0.461         0.262
    

    Created on 2018-05-04 by the reprex package (v0.2.0).