rbrms

R, brms: saving models to file inside a function call saves the entire local environment


I'm fitting some models in R using brms. The data are from an experiment with per-word reading times, and I want to fit the same kinds of models on data from different words, so I put the code to fit the models into a function that accepts the data to run the models on as an argument. I am saving the models to files so that I don't need to re-fit them for certain evaluations I'll be doing.

However, I've noticed that when I call the function that fits the models, the RDS files that brm saves grow larger and larger in size, even when the models should have the same number of parameters. I realize there will be a little variation due to the random nature of MCMC sampling, but what appears to be happening is that all of the data in the function environment at the point when the model is saved is somehow ending up in the RDS with the model object. For instance, the first model has 11 parameters (3 fixed effects, and 1 intercept + 3 fixed effects for each of 2 two random effects). This model takes up ~141 MB on disk. The second model has a different specification, but exactly the same number of parameters, and it takes up ~282 MB (2 x 141 MB) on disk. The third model has, again, the same number of parameters, and it takes up ~423 MB on disk (3 x 141 MB), and so on.

Since these models take a long time to fit, I've made a MWE that shows the same behavior on a smaller dataset with fewer samples drawn (brms will complain about the ESS, but the point is that the models finish quickly so that the sizes of the saved files can be inspected).

library(brms)

fit.models <- function() {
    set.seed(0)
    m1 <- brm(
        formula = Sepal.Length ~ Sepal.Width,
        data = iris,
        cores = 4,
        chains = 4,
        iter = 100,
        file = 'm1-function.rds'
    )
    
    set.seed(0)
    m2 <- brm(
        formula = Sepal.Length ~ Petal.Length,
        data = iris,
        cores = 4,
        chains = 4,
        iter = 100,
        file = 'm2-function.rds'
    )
}

fit.models()

set.seed(0)
m1 <- brm(
    formula = Sepal.Length ~ Sepal.Width,
    data = iris,
    cores = 4,
    chains = 4,
    iter = 100,
    file = 'm1-global.rds'
)

set.seed(0)
m2 <- brm(
    formula = Sepal.Length ~ Petal.Length,
    data = iris,
    cores = 4,
    chains = 4,
    iter = 100,
    file = 'm2-global.rds'
)

Here's the result of running this on my computer: file sizes

Note that m2-function.rds is roughly twice as large as m1-function.rds, while m1-global.rds is about the same size as m2-global.rds.

I'm not sure if this is unique to brms. However, I ran a test using some simple vectors and lists with random numbers, and all the file sizes come out exactly the same, regardless of whether they were called from within the function (which turns out to be 5202 KB).

test <- function() {
    x <- list(runif(1e6))
    saveRDS(x, 'x-function.rds')
    
    y <- list(runif(1e6))
    saveRDS(y, 'y-function.rds')
}

test()

x <- list(runif(1e6))
saveRDS(x, 'x-global.rds')

y <- list(runif(1e6))
saveRDS(x, 'y-global.rds')

So this doesn't seem to be default behavior in R for any objects saved to RDS. Whatever it is, something brms is doing with regards to how it saves files seems to be responsible. My guess is that it has something to do with how it decides what to include from the calling environment, but I don't know how to control that.

In case it's not obvious, my question is the following: how can I stop this happening so the files don't take up gobs of unnecessary space? In my case, the fitted models can take up to 1 GB already in some cases, so including that in every subsequent saved model is quickly going to get out of hand.


Solution

  • I don't know if this will help or not, but I made some hacky functions for chopping out environment bits so that functions could be stored more compactly. I haven't experimented with these lately.

    This is the kind of task that the butcher package is supposed to do, but at present it doesn't have any brms methods (but the functions below might be suitable for integration there ...)

    hack_size <- function(x, ...) {
        UseMethod("hack_size")
    }
    
    hack_size.stanfit <- function(x) {
        x@stanmodel <- structure(numeric(0), class="stanmodel")
        x@.MISC <- new.env()
        return(x)
    }
    
    hack_size.brmsfit <- function(x) {
        x$fit <- hack_size(x$fit)
        return(x)
    }
    
    hack_size.stanreg <- function(x) {
        x$stanfit <- hack_size(x$stanfit)
        return(x)
    }
    

    After running

    saveRDS(hack_size(m1), "m1-hack.rds")
    saveRDS(hack_size(m2), "m2-hack.rds")
    

    I get

     32M Apr  3 18:43 m2-function.rds
     22M Apr  3 18:43 m1-function.rds
     11M Apr  3 18:43 m1-global.rds
     11M Apr  3 18:43 m2-global.rds
     79K Apr  3 18:46 m1-hack.rds
     77K Apr  3 18:46 m2-hack.rds
    

    I don't know exactly what functionality the hacked version is capable of, but I use this in the examples for broom.mixed, so they're not completely crippled ...