rramglmobjectsize

GLM object in R takes more RAM than the object size of the GLM object


I am trying to save multiple GLM objects in a list. One GLM object is trained on a large dataset, but the size of the object is reduces by setting NULL all the unnecessary data in the GLM object. The problem is that I get RAM issues because R reserves much more RAM than the size of the GLM object. Does someone know why this problem occur and how I can solve this? Behind this saving the object results in a larger file than the object size.

Example:

> glm_full <- glm(formula = formule , data = dataset, family = binomial(), model = F, y = F)
> glm_full$data <- glm_full$model <- glm_full$residuals <- glm_full$fitted.values <- glm_full$effects <- glm_full$qr$qr <- glm_full$linear.predictors <- glm_full$weights <- glm_full$prior.weights <- glm_full$y <- NULL
> rm(list= ls()[!(ls() %in% c('glm_full'))])
> object.size(glm_full)
172040 bytes
> gc()
           used  (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells   944802  50.5    3677981  196.5   3862545  206.3
Vcells 83600126 637.9  503881514 3844.4 629722059 4804.4
> rm(glm_full)
> gc()
          used (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells  944208 50.5    2942384  157.2   3862545  206.3
Vcells 4474439 34.2  403105211 3075.5 629722059 4804.4

Here you can see that R reserves RAM for the GLM object, saving multiple GLM objects in the environment results in out of RAM problems.


Solution

  • A rough explanation for this is that glm hides pointers to the environment and things from the environment deep down inside of the glm object (and in numerous places).

    What do you need to be able to do with your glm? Even though you've nulled out a lot of the "fat" of the model, your object size will still grow linearly with your data size, and when you compound that by storing multiple glm objects, bumping up against RAM limitations is an obvious concern.

    Here is a function that will allow you to slice away pretty much everything that is non-essential, and the best part is that the glm object size will remain constant regardless of how large your data gets.

    stripGlmLR = function(cm) {
      cm$y = c()
      cm$model = c()
    
      cm$residuals = c()
      cm$fitted.values = c()
      cm$effects = c()
      cm$qr$qr = c()  
      cm$linear.predictors = c()
      cm$weights = c()
      cm$prior.weights = c()
      cm$data = c()
    
    
      cm$family$variance = c()
      cm$family$dev.resids = c()
      cm$family$aic = c()
      cm$family$validmu = c()
      cm$family$simulate = c()
      attr(cm$terms,".Environment") = c()
      attr(cm$formula,".Environment") = c()
    
      cm
    }
    

    Some notes:

    You can null out model$family entirely and the predict function will still return its default value (so, predict(model, newdata = data)) will work). However, predict(model, newdata=data, type = 'response') will fail. You can recover the response by passing the link value through the inverse link function: in the case of logistic regression, this is the sigmoid function, sigmoid(x) = 1/(1 + exp(-x)). (not sure about type = 'terms')

    Most importantly, any of the other things besides predict that you might like to do with a glm model will fail on the stripped-down version (so summary(), anova(), and step() are all a no-go). Thus, you'd be wise to extract all of this info from your glm object and then running the stripGlmLR function.

    CREDIT: Nina Zumel for an awesome analysis on glm object memory allocation