I am trying to produce 95% confidence intervals for a non-linear mixed effects model using the bootstrap method described by @BenBolker here. I may be misunderstanding some of the functions used. My goal is to simulate the 95% confidence intervals for each level of the predictor (in this example case, year).
Below is reproducible code using the FlexParamCurve
package's dataset penguin.data
. To be clear, the code provided is a modified version of Dr. Bolker's code he provided in his answer to the linked question above.
library(FlexParamCurve) #also loads package 'nlme' which is needed.
library(ggplot2)
set.seed(1234)
##creating model
fm2 <- nlme(weight ~ SSlogis(ckage, Asym, R0, lrc),
data = penguin.data,
fixed= list(Asym ~ year,
R0 ~ year,
lrc ~ year),
random = Asym ~ 1,
start = c(Asym = 1000, 0,
R0 = 21, 0,
lrc = 1, 0),
control = list(maxIter = 100),
na.action = na.pass)
#created simulated x ('ckage') values to use in prediction below
xvals.peng <- with(penguin.data,seq(min(ckage),max(ckage),length.out=100))
nresamp <- 100
## utility function
get_CI <- function(y,pref = "") {
r1 <- t(apply(y , 1 , quantile , c(0.025 , 0.975)))
setNames(as.data.frame(r1) , paste0(pref , c("lwr" , "upr"))) #function to get CI
}
##creating the data frame to use for predictions
pengframe <- with(penguin.data, data.frame(ckage = xvals.peng))
##Tried to use for weight predictions and it did not work
pengframe$weight <- predict(fm2,newdata=pengframe,level=0)
Error in eval(predvars, data, env) : object 'year' not found
This did not work because the fixed effect 'year' that is in the model was missing from pengframe
and thus I was not able to use the predict()
function. This makes total sense, so I tried a workaround using the rbind()
function:
###this is where I created the column (year) and its values separately before rbinding the data frames. There are only two levels in year.
pengframe1 <- with(penguin.data,data.frame(ckage=xvals.peng))
pengframe1$year <- as.factor('2000')
pengframe2 <- with(penguin.data,data.frame(ckage=xvals.peng))
pengframe2$year <- as.factor('2002')
pengframe <- rbind(pengframe1, pengframe2)
#predicting weight for each year now works
pengframe$weight <- predict(fm2,newdata=pengframe,level=0)
head(pengframe)
sampfun <- function(fitted,data,idvar="bandid") {
pp <- predict(fitted,levels=1)
rr <- residuals(fitted)
dd <- data.frame(data,pred=pp,res=rr)
## sample groups with replacement
iv <- levels(data[[idvar]])
bsamp1 <- sample(iv,size=length(iv),replace=TRUE)
bsamp2 <- lapply(bsamp1,
function(x) {
## within groups, sample *residuals* with replacement
ddb <- dd[dd[[idvar]]==x,]
## bootstrapped response = pred + bootstrapped residual
ddb$height <- ddb$pred +
sample(ddb$res,size=nrow(ddb),replace=TRUE)
return(ddb)
})
res <- do.call(rbind,bsamp2) ## collect results
if (is(data,"groupedData"))
res <- groupedData(res,formula=formula(data))
return(res)
}
pfun <- function(fm) {
predict(fm,newdata=pengframe,level=0)
}
yvals2 <- replicate(nresamp,
pfun(update(fm2,
data = sampfun(fm2,
penguin.data,
"bandid"))))
peng2 <- get_CI(yvals2,"boot_")
head(peng2)
pengframe <- data.frame(pengframe,peng2)
head(pengframe)
ggplot(pengframe, aes(ckage, weight, color = year)) +
geom_smooth() + #this is for simplicity purposes only. I use geom_func() in my real dataset
geom_ribbon(pengframe, mapping = aes(x = ckage, ymin = boot_lwr, ymax =boot_upr, group=year, fill = year), alpha = 0.3)
This method provided me with 95% confidence intervals for each year by using the same estimated 'ckage' as the other and is my expected result.
I wanted to confirm if this way is statistically sound? I suspect this method is okay, but I'm only beginning to get the hang of non-linear mixed models. I also wanted to ask if there is a more direct approach [i.e. adding it to the with()
function where the 'ckage' is initially simulated in when the xvals.peng
is first created to simplify the process]. I will be using sex instead of years, and I will have nested random factors (1 | group/id) which is probably a different question altogether.
Looks basically fine to me. But you can have your newdata cheaper using expand.grid()
. Also, you can keep your workspace cleaner: actually you only need one new data.frame newdata.peng
that you can expand.
> ## create newdata
> newdata.peng <- with(penguin.data,
+ expand.grid(
+ ckage=seq(min(ckage), max(ckage), length.out=100),
+ year=unique(year)
+ ))
>
>
> ## add predicted weight to newdata
> newdata.peng$weight <- predict(fm2, newdata=newdata.peng, level=0)
>
>
> ## bootstrap
> nresamp <- 100
> set.seed(1234)
> yvals2 <-
+ replicate(nresamp, {
+ pfun(update(fm2, data=sampfun(fm2, penguin.data, "bandid")))
+ })
Perhaps you need an order of magnitude more replications. Also it runs forever, try parallel::parSapply
or better parallel::mclapply
if you're on Linux.
> ## add CI weight to newdata
> newdata.peng <- cbind(newdata.peng, get_CI(yvals2, "boot_"))
Not sure why you want to smooth something or similar when plotting. I think you have clear data that no longer needs to be post-processed. Here a way w/o ggplot.
> ## plot
> ci_cols <- c('boot_lwr', 'weight', 'boot_upr')
> years <- sort(as.integer(as.character(unique(newdata.peng$year))))
>
> par(mar=c(4, 4, 2, 2) + .1)
> plot.new(); plot.window(xlim=range(newdata.peng$ckage),
+ ylim=range(newdata.peng[ci_cols]))
> for (i in 1:2) {
+ axis(i, axTicks(i))
+ mtext(c('chick age (days)', 'chick mass (g)')[i], i, 3)
+ }
> for (i in (seq_along(years))) {
+ matlines(
+ subset(newdata.peng, year == years[i], select=ci_cols),
+ col=i + 1L, lty=2:1)
+ }
> box()
> legend('bottomright', legend=c(years, '95% CI'), col=c(2:3, 8), lty=c(1, 1, 2))
Data:
> data('penguin.data', package='FlexParamCurve')
> library(nlme)
> fm2 <- nlme(weight ~ SSlogis(ckage, Asym, R0, lrc), data=penguin.data,
+ fixed=list(Asym ~ year, R0 ~ year, lrc ~ year),
+ random=Asym ~ 1, start=c(Asym=1000, 0, R0=21, 0, lrc=1, 0),
+ control=list(maxIter=100), na.action=na.pass)
Functions taken from @BenBolker's post.