I want to do t-test(or chi^2 test) to estimate the difference of variables between grou=0
and grou=1
. All variables in the dataset are imputed by MICE. Variables include AGE
,SCORE
,GENDER
, HEART
, etc.; AGE
and SCORE
are continuous variables, and GENDER
and HEART
are categorical variables.
If the t-test is done for only one variable at a time, I know the code is:
library(MICE)
data_im<-mice(data, m=5,seed=6666)
summary(pool(with(data_im,glm(AGE~grou))))
The output p-value
is also the p-value
of the t-test.
However, there are too many variables I need to evaluate,thus I would like to write a for loop or create a function to output the summary test results of multiple variable at once.
I have tried to write:
vars <- c("AGE","SCORE","GENDER","HEART")
afterMICE <- c()
for(i in 1:4){
pool_fitMICE <- pool(with(data_im,glm(substitute(y ~ grou,list(y=as.name(vars[i]))))))
}
**Error in eval(predvars, data, env) : object 'AGE' not found**
afterMICE <- rbind(afterMICE,C(vars[i],coef(summary(pool_fitMICE))[2,c(1,2,4)]))
I know the reason for the error is that data_im
is not a regular dataframe structure.
How to modify the code to achieve batch output of summary results from different variables?
#------------------------------------------------------------------------------
Edit:
mice()
is a function to do multivariate imputation for missing data. https://www.rdocumentation.org/packages/mice/versions/3.15.0/topics/mice
Take the data("nhanes2")
for example and we can see the structure of data_im
.
library(MICE)
data("nhanes2")
vars=c("bmi","chl","age","hyp")
catvars=c("age","hyp")
data_im=mice(nhanes2,m=5,seed=6666)
pool.fits <- pool(with(data_im, glm(age~hyp)))
But we need to batch pool the results of pool(with(data_im, glm(vars~hyp)))
(Variables in vars
take turns being a dependent variable in glm()
, with hyp
as the independent variable. )
You actually don't need to mess around with eval
and substitute
in this situation. You can create the formula normally (using as.formula
for example) and when the formula is evaluated inside with
it will automatically find the variables in the imputed data.
library(mice)
set.seed(123)
vars <- c("age", "bmi", "chl")
imps <- mice(nhanes, printFlag = FALSE)
mods <- list()
for(i in seq_along(vars)) {
mods[[i]] <- pool(with(imps,
glm(as.formula(paste0(vars[[i]], "~ hyp")))))
}
names(mods) <- vars
lapply(mods, summary)
#> $age
#> term estimate std.error statistic df p.value
#> 1 (Intercept) 0.5216374 0.4873196 1.070422 17.71793 0.2987954
#> 2 hyp 1.0163241 0.3744249 2.714360 19.13894 0.0136971
#>
#> $bmi
#> term estimate std.error statistic df p.value
#> 1 (Intercept) 27.0037636 2.882558 9.36798555 16.45801 5.297937e-08
#> 2 hyp -0.1502389 2.191507 -0.06855506 18.35948 9.460849e-01
#>
#> $chl
#> term estimate std.error statistic df p.value
#> 1 (Intercept) 157.88698 24.65957 6.402665 20.91373 2.445428e-06
#> 2 hyp 29.06627 19.62684 1.480945 19.90150 1.542804e-01
Created on 2023-03-08 with reprex v2.0.2
Edit: Putting output into a data.frame.
out <- data.frame(var = vars,
coef_int = 0,
coef_hyp = 0)
for(i in seq_along(vars)) {
out[i, 2:3] <- mods[[i]]$pooled$estimate
}
out
#> var coef_int coef_hyp
#> 1 age 0.5216374 1.0163241
#> 2 bmi 27.0037636 -0.1502389
#> 3 chl 157.8869758 29.0662740