r dataframe function bootstrapping emmeans

Bootstrapping emmeans derived from a multilevel regression, but "Error in t.star[r, ] <- res[[r]] : number of items to replace is not..."

I have a dataset in which participants went through 5 different situations and their behavior was measured in each situation. Here's an example dataset with a similar structure:

id1<-1:10
id<-rep(id1, each=5)
socbeh<-as.integer(rnorm(50, 3, 1))
Index1<-1:5
situation<-rep(Index1, times=10)
df<-data.frame(id, situation, socbeh)

I'm running multilevel regressions using R lme4 with participant ID as a random effect and situation as 5-level factorial predictor, like this:

library(lme4)
model<-lmer(socbeh ~ (1|id)+factor(situation), data=df)
#and then derive the estimated marginal means via emmeans
library(emmeans)
em<-emmeans(model, ~situation)

I'd like to conduct nonparametric bootstrap on the estimated marginal means and on their confidence intervals. However, all my attempts produce the error:

Error in t.star[r, ] <- res[[r]] : 
  number of items to replace is not a multiple of replacement length

I first tried the following:

fun<-function(data, idx){
     model<-lmer(socbeh ~ (1|id) + factor(situation), 
       data=data[idx,])
    rg<-ref_grid(model, mult.levs = rm_levels)
    em_<-emmeans(rg, ~situation)
}

B<-boot(df, fun, R=1000)

However, this produces the error: "Error in t.star[r, ] <- res[[r]] : number of items to replace is not a multiple of replacement length"

I tried removing all NAs from the data -> same error.

I tried perusing the advice in this response: :

df2 <- model.matrix(~socbeh + situation + id - 1, data=df)

and then ran the boot again using df2 as the data, but I still get the same error.

I also tried a simpler version of the function:

fun<-function(data, idx){
     model<-lmer(socbeh ~ (1|id) + factor(situation), 
       data=data[idx,])
       em<-emmeans(model, ~situation)
}

But got the same error again. I also tried to bootstrap just the regression coefficients by

fun<-function(data, idx) {
coef(lmer(socbeh ~ (1|id)+factor(situation), data=data[idx,]))
}
B<-boot(df2, fun, R=1000)

which has worked for me before (with continuous predictors), but now I get this error "Error in model.frame.default(data = data[idx, ], drop.unused.levels = TRUE, : variable lengths differ (found for 'factor(situation)')"

It probably goes without saying I'm very inexperienced in programming, and don't understand at all what's going on. Can anyone help? Thanks so much in advance!

Solution

Part of what emmeans() does is to reconstruct the data from the model-fitting call. Accordingly, you need the data to exist as an object in the environment where the model is fitted, and for it not to be changed between when the model was fitted and when emmeans is run. Maybe it will be enough to add a line at the beginning of the function like dat <- data[idx, ] and then use that in the model fitting. It wouldn't hurt to also add data = dat to the ref_grid call.

The other thing I wonder is what the function is supposed to return. Right now it is returning an emmGrid object, which is a pretty complicated thing. And I'm guessing that boot() is expecting a vector of estimates. If so, I think you should add a line at the end that says predict(em_) -- which will cause it to return just the EMMs.

Addendum

I tried a similar example. Definitely, the return value is an issue, and you should be returning a vector of estimates. An additional issue I found is that not all bootstrap samples have all the levels of the factor in question, and in those cases you get a different number of estimates. So you need to be extremely careful to line them up correctly.

Here is an example I got to work with the fiber data that comes with emmeans

function(data, idx) {
    dat = data[idx, ]
    fit = lm(strength ~ machine + diameter, data = dat)
    em = emmeans(fit, "machine", data = dat)
    rtn = c(A = NA, B = NA, C = NA)
    rtn[em@levels$machine] = predict(em)
    rtn
}
> boot(fiber, fun, R = 100)

ORDINARY NONPARAMETRIC BOOTSTRAP

Call:
boot(data = fiber, statistic = fun, R = 100)

Bootstrap Statistics :
    original      bias    std. error
t1* 40.38241 -0.18833737    1.154697
t2* 41.41922 -0.07372249    1.250100
t3* 38.79836 -0.11139516    1.305454

Note the last 3 lines of fun. First we set all the return values to NA, corresponding to the three levels of machine. Then we set those elements that we actually estimated (as found in em@levels$machine) to the estimates we got. Then we return rtn. Those lines of code ensure that 3 values are returned every time.