I have a fictional weighted survey dataset that shows how responses to the question "I enjoy driving fast" vary by respondents' car colors. Here's a sample of the original dataset:
Car_Color Weight Enjoy_Driving_Fast
White 0.0002849 Slightly Disagree
Red 0.0010247 Slightly Disagree
Black 0.0046459 Strongly Agree
Red 0.0048461 Strongly Agree
Red 0.0060173 Strongly Agree
Black 0.0062723 Agree
Red 0.0083730 Strongly Agree
Black 0.0115573 Strongly Agree
Black 0.0131331 Strongly Agree
White 0.0156400 Strongly Agree
White 0.0201834 Slightly Agree
White 0.0209492 Strongly Disagree
And here's a copy of my code that imports this dataset, then converts it to a survey design object:
library(tidyverse)
library(survey)
library(srvyr)
library(fastDummies)
df_car_survey <- read_csv(
'https://raw.githubusercontent.com/kburchfiel/car_survey_data/refs/heads/main/car_survey.csv')
car_survey_des <- df_car_survey %>% as_survey_design(
weights = 'Weight')
I am working on a post-hoc chi squared test that will determine whether the proportion of red car owners who agreed to this question differs from the corresponding proportion of white car owners. Because my data is stored within a survey design object, this test will be conducted using the survey
library's svychisq()
function.
I tried to run this chi squared test using the following code (which is based on Chapter 6 of Exploring Complex Survey Data Analysis Using R):
chi2_car_color_agreement_red_white_agree <- car_survey_des %>% filter(
Car_Color %in%
c("Red", "White")) %>% drop_na(Car_Color) %>% svychisq(
formula = ~ Car_Color + (Enjoy_Driving_Fast == "Agree"),
design = .,
statistic = "Chisq",
na.rm = TRUE
)
However, I received the following error:
Error in `[.data.frame`(design$variables, , as.character(cols)) :
undefined columns selected
I think the issue here is with the (Enjoy_Driving_Fast == "Agree")
component of the formula. Is there a way to modify that component in order to make it compatible with R's formula logic?
I was able to get around this issue by creating a dummy variable that indicates whether or not the respondent chose 'Agree' as their response to the "I enjoy driving fast" question, then passing that variable to the formula in place of (Enjoy_Driving_Fast == "Agree")
. Nevertheless, I would like to find a way to get the original formula to work so that I can skip the dummy variable creation step.
By looking at the source code of the two functions, svychisq
appears to be unable to handle (Enjoy_Driving_Fast == "Agree")
.
Using the anes_2020
dataset as described in the book to avoid confounding issues, we can see what happens when we step through their code:
renv::install("tidy-survey-r/srvyrexploR")
library(dplyr)
library(tidyr)
library(survey)
library(srvyr)
library(srvyrexploR)
data("anes_2020")
targetpop <- 231592693
anes_adjwgt <- anes_2020 %>%
mutate(Weight = Weight / sum(Weight) * targetpop)
anes_des <- anes_adjwgt %>%
as_survey_design(
weights = Weight,
strata = Stratum,
ids = VarUnit,
nest = TRUE
)
# this works as expected
anes_des %>%
svychisq(
formula = ~ TrustGovernment + TrustPeople,
design = .,
statistic = "Wald",
na.rm = TRUE
)
We can break things in the same way:
anes_des %>%
svychisq(
formula = ~ TrustGovernment + (TrustPeople=="Some of the time"),
design = .,
statistic = "Wald",
na.rm = TRUE
)
Error in `[.data.frame`(design$variables, , as.character(cols)) :
undefined columns selected
Now onto the source code.
svychisq.survey.design<-function(formula, design,
statistic=c("F","Chisq","Wald","adjWald","lincom","saddlepoint","wls-score"),
na.rm=TRUE,...){
# yadda
cols<-formula[[2]][[3]]
# yadda
colvar<-unique(design$variables[,as.character(cols)])
# yadda
}
When formula = ~ TrustGovernment + TrustPeople
:
form = as.formula(~ TrustGovernment + TrustPeople, env = anes_des)
colvar <- as.character(form[[2]][[3]])
colvar
[1] "TrustPeople"
anes_des$variables[1:5 , colvar]
# A tibble: 5 Ć 1
TrustPeople
<fct>
1 About half the time
2 Some of the time
3 Some of the time
4 Most of the time
5 Some of the time
surveychisq
simply turns the two terms of the formula (which are the second and third elements since +
is the first) into character variables and selects from design$variable
using base R column subsetting, df[ , "col"]
. Returning to the formula that breaks the function:
form = as.formula(~ TrustGovernment + (TrustPeople=="Some of the time"), env = anes_des)
colvar <- as.character(form[[2]][[3]])
colvar
[1] "(" "TrustPeople == \"Some of the time\""
We actually get a length 2 character vector, of which neither are columns in design$variables.
anes_des$variables[1:5 , colvar]
Error in `anes_des$variables[1:5, colvar]`:
! Can't subset columns that don't exist.
ā Columns `(` and `TrustPeople == "Some of the time"` don't exist.
So why does it work in svyttest
? Because the author implemented the function significantly differently. Note the use of eval(bquote(
in svyttest.R
eval(bquote(# blah)
R's non-standard evaluation is complicated enough to write a book on, so details are way outside the scope of this question.