rscoping

My call to glm in a function does not find the formula in the environment


In the following example, I create a function to fit a glm, but the function cannot find the formula defined immediately before. I believe this has to do with the function looking in the wrong environment, but I can't understand why. Here is an example:

n <- 20
ncov <- 3
df <- as.data.frame(replicate(ncov+1, runif(n)))
names(df) <- c(paste0("x", seq(ncov)), "y")
df

fun1 <- function(mod, pTrain = 0.5){
  print(environment())
  data <- mod$data
  y <- mod$y
  train <- sample(nrow(data), size = nrow(data)*pTrain)
  valid <- -train
  modTrain <- update(object = mod, data = data[train,])
  yhat <- predict(modTrain, newdata = data[valid,])
  res <- data.frame(y = y, yhat = yhat)
  return(res)
}

fun2 <- function(useCovs = c(1,0,0), data = df){
  print(environment())
  fmla <- formula(paste("y ~", paste(paste0("x", seq(useCovs))[as.logical(useCovs)], collapse = " + ")))
  # environment(fmla) <- environment() # does not help
  mod <- glm(formula = fmla, data = data)
  res <- fun1(mod, pTrain = 0.5)
  score <- sqrt(mean((res$y - res$yhat)^2))
  return(c(aic = AIC(mod), rmse = score))
}

fmla <- NULL # just to be sure there is no
fun2(useCovs = c(1,0,1))
# Error in eval(mf, parent.frame()) : object 'fmla' not found

If I use a <<- assignment for the formula, the function works, but I worry about the potential issues with this:

fun3 <- function(useCovs = c(1,0,0), data = df){
  print(environment())
  fmla <<- formula(paste("y ~", paste(paste0("x", seq(useCovs))[as.logical(useCovs)], collapse = " + ")))
  mod <- glm(formula = fmla, data = data)
  res <- fun1(mod, pTrain = 0.5)
  score <- sqrt(mean((res$y - res$yhat)^2))
  return(c(aic = AIC(mod), rmse = score))
}

fmla <- NULL # just to be sure there is no
fun3(useCovs = c(1,0,1)) # works
fmla # this equals the environment of fun2

Solution

  • Inspired by this post - in particular, the answer that has not been accepted - this seems to solve the problem.

    fun1 <- function(mod, pTrain = 0.5){
      data <- mod$data
      y <- mod$y
      train <- sample(nrow(data), size = nrow(data)*pTrain)
      valid <- -train
      # New code
      ev <- environment()
      parent.env(ev) <- environment(mod$formula)
      environment(mod$formula) <- ev
      # End of new code
      modTrain <- update(object = mod, data = data[train,])
      yhat <- predict(modTrain, newdata = data[valid,])
      res <- data.frame(y = y, yhat = yhat)
      return(res)
    }
    

    I cannot explain why, though the discussion in the accepted answer above is probably worth some study.

    As I mentioned in my comment, amending the signature of fun1 to

    fun1 <- function(mod, pTrain = 0.5, fmla)
    

    and the call to it in fun2 to

      res <- fun1(mod, pTrain = 0.5, fmla)
    
    

    also succeeds.