rlmr-mice

R - how to pass a formula to a with(data, lm(y ~ x)) construction


This question is highly related to R - how to pass formula to a with(df, glm(y ~ x)) construction inside a function but asks a broader question.

Why do these expressions work?

text_obj <- "mpg ~ cyl"
form_obj <- as.formula(text_obj)

with(mtcars, lm(mpg ~ cyl)) 
with(mtcars, lm(as.formula(text_obj))) 
lm(form_obj, data = mtcars)

But not this one?

with(mtcars, lm(form_obj))
Error in eval(predvars, data, env) : object 'mpg' not found

I would usually use the data argument but this is not possible in the mice package. Ie.

library(mice)
mtcars[5, 5] <- NA # introduce a missing value to be imputed
mtcars.imp = mice(mtcars, m = 5)

These don't work

lm(form_obj, data = mtcars.imp)
with(mtcars.imp, lm(form.obj))

but this does

with(mtcars.imp, lm(as.formula(text_obj)))

Thus, is it better to always thus use the as.formula argument inside the function, rather than construct it first and then pass it in?


Solution

  • An important "hidden" aspect of formulas is their associated environment.

    When form_obj is created, its environment is set to where form_obj was created:

    environment(form_obj)
    # <environment: R_GlobalEnv>
    

    For every other version, the formula's environment is created from within with(), and is set to that temporary environment. It's easiest to see this with the as.formula approach by splitting it into a few steps:

    with(mtcars, {
      f = as.formula(text_obj)
      print(environment(f))
      lm(f)
    })
    # <environment: 0x7fbb68b08588>
    

    We can make the form_obj approach work by editing its environment before calling lm:

    with(mtcars, {
      # set form_obj's environment to the current one
      environment(form_obj) = environment()
      lm(form_obj)
    })
    

    The help page for ?formula is a bit long, but there's a section on environments:

    Environments

    A formula object has an associated environment, and this environment (rather than the parent environment) is used by model.frame to evaluate variables that are not found in the supplied data argument.

    Formulas created with the ~ operator use the environment in which they were created. Formulas created with as.formula will use the env argument for their environment.

    The upshot is, making a formula with ~ puts the environment part "under the rug" -- in more general settings, it's safer to use as.formula which gives you fuller control over the environment to which the formula applies.

    You might also check Hadley's chapter on environments:

    http://adv-r.had.co.nz/Environments.html