My goal is to take a dataframe that contains individual level data of many variables and apply a user defined function to return those variables' survey-adjusted weighted means and create new columns called, wt_mean_VAR
, where each entry is the svymean
for that VAR
.
In my attempt to do so, I have defined a function, calc_wt_mean_var_fn_df_ex
, that takes a dataframe (data
) and variable name (VAR
) as parameters. The function is then supposed to create the svydesign object (svydes
) from data
, and then calculate the svymean
of VAR
. The function should return the svymean
of this VAR
.
I am then try to combine this user-defined function with a vector of variable names, i.e., dplyr::mutate:across:any_of(regularVar_names_ex)
, to create columns called wt_mean_VAR
, where every entry for each particular column is the svymean
calculated in my user defined-function for that VAR
.
Example data:
input.ds.2018 = tibble(
Var1 = c(1, 1, NA, NA, 1, 0),
Var2 = rep(c(1, 0), 3),
V3 = c(NA, rep(2, 4), 1),
y_4 = c(NA, "y", "z", "l", "m", "n"),
X_AGE80 = c(17, 18, NA, 84, 21, 72),
WT = c(2,3,8,4,0.1,5),
X_PSU = c(1,2,3,4,5,6),
X_STSTR = c(1,2,1,2,1,2)
)
regularVar_names_ex = c("Var1","Var2")
calc_wt_mean_var_fn_df_ex = function(data,VAR){
#only proceed if there are at least 2 unique answers to the variable of interest, VAR
if(data %>% filter(!is.na(eval(parse(text = VAR)))) %>% group_by(eval(parse(text = VAR))) %>% count %>% nrow > 1){
#create survey design object using dataframe, data
svydes = svydesign(id =~X_PSU,strata =~ X_STSTR, weights =~WT, data =data %>% filter(!is.na(WT) & !is.na(X_PSU)),nest = TRUE )
#calculate survey-adjusted mean and SE and return results as tibble
res = svymean((~eval(parse(text = VAR))), svydes, na.rm = TRUE) %>% as_tibble()
#extract the answer, the survey-adjusted mean weight stored in res, as ans
ans = res %>% select(mean) %>% pull
} else{
ans = NA
}
return(ans)
}
The survey-adjusted weighted mean of Var1 in input.ds.2018 = 0.505 and weighted mean of Var2 = 0.457.
When I do:
output.ds.2018 = input.ds.2018 %>%
mutate(across(all_of(regularVar_names_ex),
\(x) calc_wt_mean_var_fn_df_ex(pick(x,X_PSU,X_STSTR,WT), VAR = x),
.names = "wt_mean_{.col}"))
I would like to get the following:
output.desired = tibble(
Var1 = c(1, 1, NA, NA, 1, 0),
Var2 = rep(c(1, 0), 3),
V3 = c(NA, rep(2, 4), 1),
y_4 = c(NA, "y", "z", "l", "m", "n"),
X_AGE80 = c(17, 18, NA, 84, 21, 72),
WT = c(2,3,8,4,0.1,5),
X_PSU = c(1,2,3,4,5,6),
X_STSTR = c(1,2,1,2,1,2),
wt_mean_Var1 = c(0.505,0.505,0.505,0.505,0.505,0.505),
wt_mean_Var2 = c(0.457,0.457,0.457,0.457,0.457,0.457)
)
> output.desired
# A tibble: 6 x 10
Var1 Var2 V3 y_4 X_AGE80 WT X_PSU X_STSTR wt_mean_Var1 wt_mean_Var2
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 NA NA 17 2 1 1 0.505 0.457
2 1 0 2 y 18 3 2 2 0.505 0.457
3 NA 1 2 z NA 8 3 1 0.505 0.457
4 NA 0 2 l 84 4 4 2 0.505 0.457
5 1 1 2 m 21 0.1 5 1 0.505 0.457
6 0 0 1 n 72 5 6 2 0.505 0.457
but instead I get NAs for the weighted means. What am I doing wrong?
> output.ds.2018
# A tibble: 6 x 10
Var1 Var2 V3 y_4 X_AGE80 WT X_PSU X_STSTR wt_mean_Var1 wt_mean_Var2
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
1 1 1 NA NA 17 2 1 1 NA NA
2 1 0 2 y 18 3 2 2 NA NA
3 NA 1 2 z NA 8 3 1 NA NA
4 NA 0 2 l 84 4 4 2 NA NA
5 1 1 2 m 21 0.1 5 1 NA NA
6 0 0 1 n 72 5 6 2 NA NA
I think there were a couple of things going on here, but the biggest one is that when your function expected VAR
to be a variable name, but when you pass the variable in via mutate()
, it is a vector of values. So, the function wasn't getting what it expected for VAR
. I rewrote the function to take vectors as arguments - things that will get passed in via mutate()
. The function now takes the names of the variables for VAR
, PSU
, STRATA
and WT
. Regardless of those names in the input dataset, these variables take on those names (PSU
, STRATA
, WT
and VAR
) inside the function. That makes it a bit easier to deal with and removes the need for eval(parse(text = ...))
, which is often frowned upon around here.
calc_wt_mean_var_fn_df_ex = function(VAR, PSU, STRATA, WT){
#only proceed if there are at least 2 unique answers to the variable of interest, VAR
data <- data.frame(VAR, PSU, STRATA,WT, one=1)
data <- subset(data, !is.na(PSU) & !is.na(WT))
nr <- nrow(na.omit(aggregate(data$one, list(data$VAR), sum)))
if(nr > 1){
#create survey design object using dataframe, data
svydes = svydesign(id =~PSU,
strata =~STRATA,
weights =~WT,
data =data,nest = TRUE )
#calculate survey-adjusted mean and SE and return results as tibble
res = svymean(~VAR, svydes, na.rm = TRUE)
#extract the answer, the survey-adjusted mean weight stored in res, as ans
ans = res[[1]]
} else{
ans = NA
}
return(ans)
}
Here's the function at work:
library(survey)
library(dplyr)
input.ds.2018 = tibble(
Var1 = c(1, 1, NA, NA, 1, 0),
Var2 = rep(c(1, 0), 3),
V3 = c(NA, rep(2, 4), 1),
y_4 = c(NA, "y", "z", "l", "m", "n"),
X_AGE80 = c(17, 18, NA, 84, 21, 72),
WT = c(2,3,8,4,0.1,5),
X_PSU = c(1,2,3,4,5,6),
X_STSTR = c(1,2,1,2,1,2)
)
regularVar_names_ex = c("Var1","Var2")
input.ds.2018 %>%
mutate(across(all_of(regularVar_names_ex),
\(x) calc_wt_mean_var_fn_df_ex(x, X_PSU, X_STSTR, WT),
.names = "wt_mean_{.col}"))
#> # A tibble: 6 × 10
#> Var1 Var2 V3 y_4 X_AGE80 WT X_PSU X_STSTR wt_mean_Var1 wt_mean_Var2
#> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 NA <NA> 17 2 1 1 0.505 0.457
#> 2 1 0 2 y 18 3 2 2 0.505 0.457
#> 3 NA 1 2 z NA 8 3 1 0.505 0.457
#> 4 NA 0 2 l 84 4 4 2 0.505 0.457
#> 5 1 1 2 m 21 0.1 5 1 0.505 0.457
#> 6 0 0 1 n 72 5 6 2 0.505 0.457
Created on 2025-01-29 with reprex v2.1.1