Assume we use the following example from the mice
package:
imp <- mice(nhanes, m=5, print=FALSE, seed=55152)
I want to obtain the means of nhanes$chl
grouped by nhanes$hyp
. Without using multiple imputation, one would do:
library(mice)
library(dplyr)
nhanes %>%
group_by(hyp) %>%
summarise(chl=mean(chl, na.rm=TRUE))
How can I obtain such pooled estimates of means of chl
per group of hyp
after multiple imputation with mice
?
Without imputation, only rows with complete cases for chl and hyp are used, referred to as list-wise deletion. Using this method, we get these mean
s by group:
> tapply(X=nhanes$chl, INDEX=nhanes$hyp, FUN=mean, na.rm=TRUE)
1 2
181.8182 229.0000
Multiple imputation,
> m. <- 5
> imp <- mice::mice(nhanes, m=m., print=FALSE, seed=55152)
gives us m imputed datasets, that we usually combine in a long format. (Note, that m > 20 is recommended.1)
> long <- mice::complete(imp, action='long')
> with(long, table(.imp)) ## nrows by .imp
.imp
1 2 3 4 5
25 25 25 25 25
> nrow(nhanes) ## nrows original
[1] 25
Each imputed dataset will have it's own statistics (i.e. group mean
s in this case). For example .imp == 1
will have:
> tapply(long[long$.imp == 1, ]$chl, long[long$.imp == 1, ]$hyp, mean, na.rm=TRUE)
1 2
185.8421 212.5000
According to Rubin's rules, what we want is the mean of these imputed statistics to get the estimate. Accordingly we calculate the statistics for each imputation,
> (r0 <- do.call('rbind', by(long, ~ .imp, \(x) tapply(x$chl, x$hyp, mean))))
1 2
1 185.8421 212.5000
2 186.9000 231.0000
3 187.9000 237.8000
4 189.0000 232.5000
5 188.1053 215.8333
and take the means of these results.
> colMeans(r0)
1 2
187.5495 225.9267
We might report the estimates with standard errors, which we compute accordingly:
> ## compute mean and variance for each imputation
> R1 <- by(long, ~ .imp, \(x)
+ sapply(c(mean=mean, var=\(x) var(x)/length(x)), \(f)
+ tapply(x$chl, x$hyp, f))) |>
+ simplify2array()
>
> Q <- rowMeans(R1[, 1, ]) ## calculate mean estimates
> U <- rowMeans(R1[, 2, ]) ## calculate within variances
>
> B <- rowSums(((R1[, 1, ] - Q)^2))/(m. - 1) ## calculate between variances
>
> T <- U + (1 + 1/m.)*B ## calculate total variances
>
> cbind(Estimate=Q, 'Std. Error'=sqrt(T))
Estimate Std. Error
1 187.5495 9.175414
2 225.9267 21.404731
Data:
> nhanes <- mice::nhanes