rfunctionsurveymutateacross

Calculate svymean for variable within user-defined function called by mutate(), across(), with variable name as function parameter


My goal is to take a dataframe that contains individual level data of many variables and apply a user defined function to return those variables' survey-adjusted weighted means and create new columns called, wt_mean_VAR, where each entry is the svymean for that VAR.

In my attempt to do so, I have defined a function, calc_wt_mean_var_fn_df_ex, that takes a dataframe (data) and variable name (VAR) as parameters. The function is then supposed to create the svydesign object (svydes) from data, and then calculate the svymean of VAR. The function should return the svymean of this VAR.

I am then try to combine this user-defined function with a vector of variable names, i.e., dplyr::mutate:across:any_of(regularVar_names_ex), to create columns called wt_mean_VAR, where every entry for each particular column is the svymean calculated in my user defined-function for that VAR.

Example data:

input.ds.2018 = tibble(
  Var1 = c(1, 1, NA, NA, 1, 0),
  Var2 = rep(c(1, 0), 3),
  V3 = c(NA, rep(2, 4), 1),
  y_4 = c(NA, "y", "z", "l", "m", "n"),
  X_AGE80 = c(17, 18, NA, 84, 21, 72),
  WT = c(2,3,8,4,0.1,5),
  X_PSU = c(1,2,3,4,5,6),
  X_STSTR = c(1,2,1,2,1,2)
)

regularVar_names_ex = c("Var1","Var2") 

calc_wt_mean_var_fn_df_ex = function(data,VAR){
  #only proceed if there are at least 2 unique answers to the variable of interest, VAR
  if(data %>% filter(!is.na(eval(parse(text = VAR)))) %>% group_by(eval(parse(text = VAR))) %>% count %>%  nrow > 1){
    #create survey design object using dataframe, data
    svydes = svydesign(id =~X_PSU,strata =~ X_STSTR, weights =~WT, data =data %>% filter(!is.na(WT) & !is.na(X_PSU)),nest = TRUE )
    #calculate survey-adjusted mean and SE and return results as tibble
    res = svymean((~eval(parse(text = VAR))), svydes, na.rm = TRUE) %>%  as_tibble()
    #extract the answer, the survey-adjusted mean weight stored in res, as ans
    ans = res %>% select(mean) %>% pull 
  } else{
    ans = NA
  }
  return(ans)
}

The survey-adjusted weighted mean of Var1 in input.ds.2018 = 0.505 and weighted mean of Var2 = 0.457.

When I do:

output.ds.2018 = input.ds.2018 %>%
  mutate(across(all_of(regularVar_names_ex), 
                \(x) calc_wt_mean_var_fn_df_ex(pick(x,X_PSU,X_STSTR,WT), VAR = x), 
                .names = "wt_mean_{.col}"))

I would like to get the following:

output.desired = tibble(
  Var1 = c(1, 1, NA, NA, 1, 0),
  Var2 = rep(c(1, 0), 3),
  V3 = c(NA, rep(2, 4), 1),
  y_4 = c(NA, "y", "z", "l", "m", "n"),
  X_AGE80 = c(17, 18, NA, 84, 21, 72),
  WT = c(2,3,8,4,0.1,5),
  X_PSU = c(1,2,3,4,5,6),
  X_STSTR = c(1,2,1,2,1,2),
  wt_mean_Var1 = c(0.505,0.505,0.505,0.505,0.505,0.505),
  wt_mean_Var2 = c(0.457,0.457,0.457,0.457,0.457,0.457)
)
> output.desired
# A tibble: 6 x 10
   Var1  Var2    V3 y_4   X_AGE80    WT X_PSU X_STSTR wt_mean_Var1 wt_mean_Var2
  <dbl> <dbl> <dbl> <chr>   <dbl> <dbl> <dbl>   <dbl>        <dbl>        <dbl>
1     1     1    NA NA         17   2       1       1        0.505        0.457
2     1     0     2 y          18   3       2       2        0.505        0.457
3    NA     1     2 z          NA   8       3       1        0.505        0.457
4    NA     0     2 l          84   4       4       2        0.505        0.457
5     1     1     2 m          21   0.1     5       1        0.505        0.457
6     0     0     1 n          72   5       6       2        0.505        0.457

but instead I get NAs for the weighted means. What am I doing wrong?

> output.ds.2018
# A tibble: 6 x 10
   Var1  Var2    V3 y_4   X_AGE80    WT X_PSU X_STSTR wt_mean_Var1 wt_mean_Var2
  <dbl> <dbl> <dbl> <chr>   <dbl> <dbl> <dbl>   <dbl> <lgl>        <lgl>       
1     1     1    NA NA         17   2       1       1 NA           NA          
2     1     0     2 y          18   3       2       2 NA           NA          
3    NA     1     2 z          NA   8       3       1 NA           NA          
4    NA     0     2 l          84   4       4       2 NA           NA          
5     1     1     2 m          21   0.1     5       1 NA           NA          
6     0     0     1 n          72   5       6       2 NA           NA

Solution

  • I think there were a couple of things going on here, but the biggest one is that when your function expected VAR to be a variable name, but when you pass the variable in via mutate(), it is a vector of values. So, the function wasn't getting what it expected for VAR. I rewrote the function to take vectors as arguments - things that will get passed in via mutate(). The function now takes the names of the variables for VAR, PSU, STRATA and WT. Regardless of those names in the input dataset, these variables take on those names (PSU, STRATA, WT and VAR) inside the function. That makes it a bit easier to deal with and removes the need for eval(parse(text = ...)), which is often frowned upon around here.

    calc_wt_mean_var_fn_df_ex = function(VAR, PSU, STRATA, WT){
      #only proceed if there are at least 2 unique answers to the variable of interest, VAR
      data <- data.frame(VAR, PSU, STRATA,WT, one=1) 
      data <- subset(data, !is.na(PSU) & !is.na(WT))
      nr <- nrow(na.omit(aggregate(data$one, list(data$VAR), sum)))
       
      if(nr > 1){
        #create survey design object using dataframe, data
        svydes = svydesign(id =~PSU,
                           strata =~STRATA, 
                           weights =~WT, 
                           data =data,nest = TRUE )
        #calculate survey-adjusted mean and SE and return results as tibble
        res = svymean(~VAR, svydes, na.rm = TRUE) 
        #extract the answer, the survey-adjusted mean weight stored in res, as ans
        ans = res[[1]]
      } else{
        ans = NA
      }
      return(ans)
    }
    

    Here's the function at work:

    library(survey)
    library(dplyr)
    input.ds.2018 = tibble(
      Var1 = c(1, 1, NA, NA, 1, 0),
      Var2 = rep(c(1, 0), 3),
      V3 = c(NA, rep(2, 4), 1),
      y_4 = c(NA, "y", "z", "l", "m", "n"),
      X_AGE80 = c(17, 18, NA, 84, 21, 72),
      WT = c(2,3,8,4,0.1,5),
      X_PSU = c(1,2,3,4,5,6),
      X_STSTR = c(1,2,1,2,1,2)
    )
    
    regularVar_names_ex = c("Var1","Var2") 
    
    input.ds.2018 %>%
      mutate(across(all_of(regularVar_names_ex), 
                    \(x) calc_wt_mean_var_fn_df_ex(x, X_PSU, X_STSTR, WT), 
                    .names = "wt_mean_{.col}"))
    #> # A tibble: 6 × 10
    #>    Var1  Var2    V3 y_4   X_AGE80    WT X_PSU X_STSTR wt_mean_Var1 wt_mean_Var2
    #>   <dbl> <dbl> <dbl> <chr>   <dbl> <dbl> <dbl>   <dbl>        <dbl>        <dbl>
    #> 1     1     1    NA <NA>       17   2       1       1        0.505        0.457
    #> 2     1     0     2 y          18   3       2       2        0.505        0.457
    #> 3    NA     1     2 z          NA   8       3       1        0.505        0.457
    #> 4    NA     0     2 l          84   4       4       2        0.505        0.457
    #> 5     1     1     2 m          21   0.1     5       1        0.505        0.457
    #> 6     0     0     1 n          72   5       6       2        0.505        0.457
    

    Created on 2025-01-29 with reprex v2.1.1