rmass

in the MASS::boxcox function, I often get the dreaded "'data' must be a data.frame, environment, or list" error erroneously


I'm running R 4.4.1 and MASS 7.3-61 on my MacBook Pro, 14", Nov 2023, which has MacOS 14.6.

Here's some reproducible code as a MWE:

require(MASS)

set.seed(42)
x = rnorm(5)
y = rnorm(5)

df = data.frame(x, y)

lmod = lm(y~x, data=df)
boxcox(lmod)

This produces the error:

Error in model.frame.default(formula = y ~ x, data = df, drop.unused.levels = TRUE) : 
  'data' must be a data.frame, environment, or list

The variable df is clearly a data frame, so this error message is totally erroneous:

> class(df)
[1] "data.frame"
> is.data.frame(df)
[1] TRUE

I'm obviously specifying the model correctly, so that cause is not relevant. If I try the traceback() function, it yields the following:

16: stop("'data' must be a data.frame, environment, or list")
15: model.frame.default(formula = y ~ x + cat, data = df, drop.unused.levels = TRUE)
14: stats::model.frame(formula = y ~ x + cat, data = df, drop.unused.levels = TRUE)
13: eval(mf, parent.frame())
12: eval(mf, parent.frame())
11: lm(formula = y ~ x + cat, data = df, y = TRUE, qr = TRUE)
10: eval(call, parent.frame())
9: eval(call, parent.frame())
8: update.default(object, y = TRUE, qr = TRUE, ...)
7: update(object, y = TRUE, qr = TRUE, ...)
6: boxcox.lm(lmod, plotit = TRUE)
5: boxcox(lmod, plotit = TRUE) at test_boxcox.R#20
4: eval(ei, envir)
3: eval(ei, envir)
2: withVisible(eval(ei, envir))
1: source("~/Projects/non_repo_data/test_boxcox.R")

But going through the stats::model.frame.default function's source code did not reveal this stop command anywhere. I'm at a total loss for understanding why this is happening, or even whence the error is arising. Definitely feels like a bug, though.


Solution

  • tl;dr you have to name your data frame something other than df, so that it doesn't collide with a built-in R object.

    The error itself arises from line 526 of src/library/stats/R/models.R.

    This is arguably a bug, or at least an "infelicity" (sensu Bill Venables), in MASS::boxcox, but it is also an illustration of why it's good to avoid name overlaps between your variables and built-in objects. (I've submitted a bug report.)

    Continuing with your example:

    dff <- df  ## rename your data frame
    lmod <- lm(y~x, data=dff)
    boxcox(lmod)
    

    Error in boxcox.default(lmod) : response variable must be positive

    This error happens because you constructed a slightly inappropriate example (which was fine for showing what you wanted).

    lmod <- lm(abs(y)~x, data=dff)
    boxcox(lmod)  ## works
    

    We can get a hint of what's going on by looking at the output of traceback():

    12: stop("'data' must be a data.frame, environment, or list")
    11: model.frame.default(formula = y ~ x, data = df, drop.unused.levels = TRUE)
    10: stats::model.frame(formula = y ~ x, data = df, drop.unused.levels = TRUE)
    9: eval(mf, parent.frame())
    8: eval(mf, parent.frame())
    7: lm(formula = y ~ x, data = df, y = TRUE, qr = TRUE)
    6: eval(call, parent.frame())
    5: eval(call, parent.frame())
    4: update.default(object, y = TRUE, qr = TRUE, ...)
    3: update(object, y = TRUE, qr = TRUE, ...)
    2: boxcox.lm(lmod)
    1: boxcox(lmod)
    

    It would take just a little more work than I feel like doing right now to establish exactly what all those parent.frame() invocations are seeing. From within the lm() call (you can get there by setting options(error = recover), you can see that the enclosing environment of the parent frame parent.frame()$enclos is <environment:base>. I'm not quite sure how we get from there to <environment: namespace:stats>, which is where we're getting df from ...