rbonferroni

How to improve efficiency for doing chi-squre test for over 10 outcomes and 5 variables?


data <- data.frame(
  sex = factor(c("M", "F", "M")),
  ageid = factor(c(8, 6, 7)),
  married = factor(c(2, 1, 2)),
  cagv_typ = factor(c("non-primary", "primary", "non-primary")),
  sq5_1 = factor(c(1, 1, 1)),
  sq5_2 = factor(c(0, 1, 0))
)

Among this dataframe, sex and married are variable, and the rest of them are outcomes. Actually I have more than 10 outcome variables and 5 subgroup variables.

At first, I made the following codes:

chisq_test <- function(data, var1, var2) {
  contingency_table <- table(data[[var1]], data[[var2]])
  test_result <- chisq.test(contingency_table)
  return(test_result)
}

chisq_test(data = sq_catvar, var1 = "sex", var2 = "cagv_typ")

However, I found it still is super time-consuming if I manually input the outcome and variables one by one. Thus, I wonder if there is better approach to do chi-square test with reduced time.

Thank you in advance.

Best wishes


Solution

  • You can use expand.grid to get all the combinations you are looking for:

    combos <- expand.grid(x = names(data)[c(1, 3)], y = names(data)[-c(1, 3)])
    
    combos
    #>         x        y
    #> 1     sex    ageid
    #> 2 married    ageid
    #> 3     sex cagv_typ
    #> 4 married cagv_typ
    #> 5     sex    sq5_1
    #> 6 married    sq5_1
    #> 7     sex    sq5_2
    #> 8 married    sq5_2
    

    And we can use apply to iterate down this data frame and apply your chisq_test function to each combination of variables. This will return a list of 8 chi-square tests:

    combos$pval <- apply(combos, 1, function(x) chisq_test(data, x[1], x[2])$p.val)
    
    combos
    #>         x        y      pval
    #> 1     sex    ageid 0.2231302
    #> 2 married    ageid 0.2231302
    #> 3     sex cagv_typ 0.6650055
    #> 4 married cagv_typ 0.6650055
    #> 5     sex    sq5_1 0.5637029
    #> 6 married    sq5_1 0.5637029
    #> 7     sex    sq5_2 0.6650055
    #> 8 married    sq5_2 0.6650055
    

    This will easily scale up to five x variables and 10 y variables using the same code.

    Please remember that if you are carrying out 50 Chi square tests, the p values will not be valid due to multiple hypothesis testing, and you will need a Bonferroni correction or similar to take account of the fact that you would expect 2 or 3 "significant" results purely by chance with this many significance tests.

    Created on 2023-09-12 with reprex v2.0.2