rformulaglmlmmodel.matrix

Check that model has only one factor covariate


I am writing an R package, where the main function takes a model, which may only have a single factor covariate (offsets are allowed). To make sure the user complies with this rule I need to check this.

As an example, let's look at the following four models:

set.seed(123)
n <- 10 

## data
data <- data.frame(y = rnorm(n), 
  trt = rep(c(0, 1), each = n/2),
  x = 1:n)
datan <- data
datan$trt <- as.factor(datan$trt)

## models
mod1 <- lm(y ~ factor(trt), data = data)
mod2 <- lm(y ~ offset(x) + as.factor(trt), data = data)
mod3 <- lm(y ~ trt, data = datan)
mod4 <- glm(y ~ trt + offset(x), data = data)
mod5 <- lm(y ~ x + as.factor(trt), data = data)

Models 1, 2 and 3 are ok, models 4 and 5 are not ok (model 4 has a non-factor variable trt, model 5 has a second covariate x).

How do I check this using R? Optimally I'd get a TRUE for a model that's ok, and a FALSE for a problematic model.

This should work not only with lm() and glm(), but also with survreg() and coxph() (from package survival). Something that might be useful is looking at the formula eval(getCall(mod1)$formula) and the data (data/ datan).


Solution

  • As indicated in the previous reply by @LAP you can use the terms() from these models. However, I would recommend to look at the attr(..., "factors") and attr(..., "dataClasses") instead of going to the $model which requires that the entire model.frame() is stored in the model. This may or may not be the case. Specifically, when re-fitting multiple models you might want to be able to not store the model frame each time.

    So one idea would be to proceed in the following steps:

    R code:

    one_factor <- function(object) {
      f <- attr(terms(object), "factors")
      if(length(f) == 0L || NCOL(f) != 1L) return(FALSE)
      d <- attr(terms(object), "dataClasses")
      if(d[colnames(f)] %in% c("ordered", "factor")) {
        return(TRUE)
      } else {
        return(FALSE)
      }
    }
    

    This appears to work well for single-part formula-based objects.

    Dummy data with numeric/factor/ordered trt:

    d1 <- d2 <- d3 <- data.frame(y = log(1:9), x = 1:9, trt = rep(1:3, each = 3)) 
    d2$trt <- factor(d2$trt)
    d3$trt <- ordered(d3$trt)
    

    Various formula specifications:

    f <- list(
      y ~ 1,
      y ~ x,
      y ~ trt,
      y ~ trt + x,
      y ~ trt + offset(x),
      y ~ trt + x + offset(x),
      y ~ trt + offset(as.numeric(trt)),
      y ~ factor(trt),
      y ~ factor(trt) + offset(x),
      y ~ factor(x > as.numeric(trt)),
      y ~ interaction(x, trt),
      y ~ 0 + trt
    )
    

    Expected results for d1, d2, and d3, respectively:

    ok1 <- c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE)
    ok2 <- c(FALSE, FALSE, TRUE,  FALSE, TRUE,  FALSE, TRUE,  TRUE, TRUE, TRUE, TRUE, TRUE)
    ok3 <- ok2
    

    Checks for lm without storing the model frame:

    lm1 <- lapply(f, lm, data = d1, model = FALSE)
    identical(sapply(lm1, one_factor), ok1)
    ## [1] TRUE
    lm2 <- lapply(f, lm, data = d2, model = FALSE)
    identical(sapply(lm2, one_factor), ok2)
    ## [1] TRUE
    lm3 <- lapply(f, lm, data = d3, model = FALSE)
    identical(sapply(lm3, one_factor), ok3)
    ## [1] TRUE
    

    Checks for survreg (Gaussian) and coxph. (The latter throws a lot of warnings about non-convergence which is not surprising given the dummy data structure. The checks still work as intended.)

    library("survival")
    d1$y <- d2$y <- d3$y <- Surv(d1$y + 0.5)
    
    sr1 <- lapply(f, survreg, data = d1)
    identical(sapply(sr1, one_factor), ok1)
    ## [1] TRUE
    sr2 <- lapply(f, survreg, data = d2)
    identical(sapply(sr2, one_factor), ok2)
    ## [1] TRUE
    sr3 <- lapply(f, survreg, data = d3)
    identical(sapply(sr3, one_factor), ok3)
    ## [1] TRUE
    
    cph1 <- lapply(f, coxph, data = d1)
    identical(sapply(cph1, one_factor), ok1)
    ## [1] TRUE
    cph2 <- lapply(f, coxph, data = d2)
    identical(sapply(cph2, one_factor), ok2)
    ## [1] TRUE
    cph3 <- lapply(f, coxph, data = d3)
    identical(sapply(cph3, one_factor), ok3)
    ## [1] TRUE
    

    Note: If you have multi-part Formula-based objects this function might fail and the underlying tests would need to be adapted. Examples for the latter could include count regression models (zeroinfl, hurdle), multinomial logit (mlogit), instrumental variables (ivreg), heteroscedastic models (vglm, betareg, crch) etc. These might have formulas like y ~ trt | 1 or y ~ trt | trt or y ~ trt | x which may or may not still be feasible in your framework.