rcluster-analysisimputationdimensionality-reduction

Number of observations changing significantly after imputing using mice() in R


My data has a significant number of missing values, so I can't use the na.omit() default in order to conduct downstream analysis on my dataset, as this removes the whole row if there is even one value absent. It is my understanding that the mice package is a robust way to perform so many consecutive imputations of the data, but after I perform the complete() function, my number of observations jumps from ~60 to 300.

Here is my code where the jump seems to lie:

# Impute missing values using mice. mydata: 58obs of 30 variables
imputed_data <- mice(mydata, m = 5, method = "pmm", seed = 123)

# Pool the imputed datasets - here is where my dataframe seems to change considerably in structure. pooled_data becomes 290obs of 32 variables
pooled_data <- complete(imputed_data, "long", include = FALSE)

Why is this the case, and am I still able to perform downstream dimensionality reduction and statistical testing on this data frame and have it be representative of the original dataset? This is my goal with the imputation. If there are superior methods to perform this imputation, I'm also very interested to learn. Thanks in advance!


Solution

  • Well, you apparently used m=5 (the default), which results in 5 imputed datasets, and 5*60 == 300; probably you've noticed the new .imp column,m which represents the imputation ID. Recommended is an even higher value of m (see my respective answer on this). The imputed datasets slightly differ, which represents the uncertainty in imputation. You can do any analyses, where your estimate will be the mean value of the m analyses. For calculating the variance, following Rubin, you should account for both the within-imputation variability and the additional uncertainty introduced by the imputation process itself, see there.