I have a function that fits a model which I call many times with the same big matrix (creating different formula inside each time). However, it seems that R saves copies of the data I use along the way, and so my memory explodes.
A simple deletion inside the function avoid this problem. However, is there a general way to avoid retaining the whole environment each time?
For example, running the following,
test <- function(X, y, rm.env=F){
df <- cbind(y, X)
names(df) <- c("label", paste0("X", as.character(1:ncol(X))))
f <- formula(label~1, data=df, env=emptyenv())
if (rm.env){
rm(list=c("df", "X", "y"))
}
print(pryr::object_size(f))
return(f)
}
X <- matrix(rnorm(700*10000), ncol=700)
y <- rnorm(10000)
m <- test(X, y)
print(pryr::object_size(m))
m <- test(X, y, rm.env=T)
print(pryr::object_size(m))
results in,
672 B
168 MB
672 B
1.13 kB
Note that the object in the first call has 168 MB behind it, so calling the first version over and over again eats a lot of memory fast.
formula(label~1, data=df, env=emptyenv())
calls the S3 method formula.formula
. Let’s have a look at its code:
stats:::formula.formula
# function (x, ...)
# x
… the extra arguments are ignored!
In other words, your assignment is the same as if you had written simply f = label ~ 1
. In particular, its associated environment is the local environment, not the empty environment. To fix this, you need to manually reset it:
test <- function (X, y) {
df <- cbind(y, X)
names(df) <- c("label", paste0("X", seq_along(X)))
# TODO: do something with `df` …
f <- label ~ 1
environment(f) <- emptyenv()
f
}