I have a longitudinal (panel) data frame called tradep_red
in long format that contains 200 countries (country
), 26 years (year
), the continuous dependent variable gini
and 2 continuous predictor variables (trade
and unempl
, in reality there are 13 but I reduced it to 2 for the sake of this question). Both gini
and the predictor variables contain missing values. Dummy data is shown below:
# Generate dummy data
set.seed(12345)
country <- as.factor(rep(1:200, each = 26))
year <- rep(1:26, times = 200)
gini <- rnorm(n = 200*26, mean = 20, sd = 4)
trade <- rnorm(n = 200*26, mean = 1000, sd = 7)
unempl <- rnorm(n = 200*26, mean = 4, sd = 0.2)
# Add NA values
missing_indices_gini <- sample(1:length(gini), 1000)
gini[missing_indices_gini] <- NA
missing_indices_trade <- sample(1:length(trade), 800)
trade[missing_indices_trade] <- NA
missing_indices_unempl <- sample(1:length(unempl), 900)
unempl[missing_indices_unempl] <- NA
# Combine into dataframe
tradep_red <- data.frame(country, year, gini, trade, unempl)
head(tradep_red)
## country year gini trade unempl
## 1 1 1 22.34212 1006.3982 3.740346
## 2 1 2 22.83786 997.7583 3.801918
## 3 1 3 19.56279 996.9160 3.699202
## 4 1 4 NA NA 3.838534
## 5 1 5 22.42355 996.0563 3.835563
## 6 1 6 NA 1005.5007 4.115319
I want to multiple impute the missing values in the data while specifically accounting for the multilevel structure in the data (i.e. clustering by country
). With the code below (using the mice
package), I have been able to create imputed data sets with the pmm
method.
library(mice)
# Multiple imputation
predictorMatrix <- quickpred(tradep_red,
include = c("country", "gini", "trade", "unempl"),
exclude = c("year"), mincor = 0.1)
imp <- mice(data = tradep_red,
m = 3,
maxit = 5,
method = "pmm",
predictorMatrix = predictorMatrix,
seed = 123)
However, I would like to use the 2l.pan
method (or another method such as panImpute
) to account for the cluster variable country
. The 2l.pan
method requires a cluster variable to be specified in the predictorMatrix
by giving country
a value of -2
, and then running the imputation:
predictorMatrix["country", ] <- -2 # specify country as cluster variable
imp <- mice(data = tradep_red,
m = 3,
maxit = 5,
method = "2l.pan",
predictorMatrix = predictorMatrix,
seed = 123)
This however gives the error:
## iter imp variable
## 1 1 giniError in mice.impute.2l.pan(y = c(22.3421152713754, 22.8378640700381, :
## No class variable
Alternatively, the cluster variable can be specified in a formula
statement with the |
operator. Moreover, the formula statement is required to be a list
. I have not succeeded in correctly specifying this formula statement. The code below shows what I have tried:
formula_imp <- list(gini + trade + unempl ~ (1 | country))
imp <- mice(data = tradep_red,
m = 3,
maxit = 5,
method = "2l.pan",
predictorMatrix = predictorMatrix,
formulas = formula_imp,
seed = 123)
This gives the error:
## iter imp variable
## 1 1 gini trade unempl giniError in mice.impute.2l.pan(y = c(22.3421152713754, 22.8378640700381, :
## No class variable
## In addition: Warning messages:
## 1: In Ops.factor(1, country) : ‘|’ not meaningful for factors
## 2: In Ops.factor(1, country) : ‘|’ not meaningful for factors
## 3: In Ops.factor(1, country) : ‘|’ not meaningful for factors
I get similar errors when trying to use the alternative panImpute
method in the mice
function. How can I correctly specify country
to be the cluster variable for the multiple imputation process? Any help or references are greatly appreciated!
The class
variable needs to be integer. Thus add the following and your first attempt with the predictorMatrix
will work
tradep_red = tradep_red %>% mutate(country = country %>% as.integer() )