rmemorymemory-leaksr-formula

R function with formula return has large memory imprint


I have a function that fits a model which I call many times with the same big matrix (creating different formula inside each time). However, it seems that R saves copies of the data I use along the way, and so my memory explodes.

A simple deletion inside the function avoid this problem. However, is there a general way to avoid retaining the whole environment each time?

For example, running the following,

test <- function(X, y, rm.env=F){
  df <- cbind(y, X)
  names(df) <- c("label", paste0("X", as.character(1:ncol(X))))
  f <- formula(label~1, data=df, env=emptyenv())
  if (rm.env){
    rm(list=c("df", "X", "y"))
  }
  print(pryr::object_size(f))
  return(f)
}

X <- matrix(rnorm(700*10000), ncol=700)
y <- rnorm(10000)

m <- test(X, y)
print(pryr::object_size(m))

m <- test(X, y, rm.env=T)
print(pryr::object_size(m))

results in,

672 B
168 MB
672 B
1.13 kB

Note that the object in the first call has 168 MB behind it, so calling the first version over and over again eats a lot of memory fast.


Solution

  • formula(label~1, data=df, env=emptyenv()) calls the S3 method formula.formula. Let’s have a look at its code:

    stats:::formula.formula
    # function (x, ...)
    # x
    

    the extra arguments are ignored!

    In other words, your assignment is the same as if you had written simply f = label ~ 1. In particular, its associated environment is the local environment, not the empty environment. To fix this, you need to manually reset it:

    test <- function (X, y) {
      df <- cbind(y, X)
      names(df) <- c("label", paste0("X", seq_along(X)))
      # TODO: do something with `df` …
      f <- label ~ 1
      environment(f) <- emptyenv()
      f
    }