rformula

Add interaction terms in formula


I would like to write an R function that adds interaction terms to a formula.

For instance, the function takes the formula mpg ~ cyl + gear + disp, the treatment variable cyl and a character vector of control variables c("gear","disp") and returns mpg ~ cyl + cyl * gear + cyl * disp.

Ideally, the function should return an error if one of the control variables is not in the formula, or if the interaction term is already in the formula.

I came up with the following, which seems to work but uses string manipulation rather than first principles.

I think this makes it more prone to errors and slower.

How can I re-write it to use first principles?

#' Add interaction terms in a formula
#' 
#' @param form A formula
#' @param treat The treatment variable (string)
#' @param controls A character vector of control variables
#' @return A formula with interaction terms added between `treat` and each variable in `controls`
#' @export
#' @examples
#' reformulas_addints(mpg ~ cyl + gear, "cyl", c("gear"))
#' reformulas_addints(mpg ~ cyl + gear + disp, "cyl", c("gear", "disp"))
#' reformulas_addints(mpg ~ cyl + gear, "cyl", c("gears"))
#' reformulas_addints(mpg ~ cyl + cyl*gear, "cyl", c("gear"))
reformulas_addints <- function(form, treat, controls) {
  form_str <- as.character(form)
  for (control in controls) {
    if(!stringr::str_detect(form_str, control)){
      stop(paste0("The variable '", control, "' is not in the formula."))
    }
    patt <- paste0(r"(\s*)",treat,r"(\s*\*\s*)",control, r"(\s*)")
    if(stringr::str_detect(form_str, patt)){
      stop(paste0("The interaction '",treat, " * ", control, "' is already in the formula."))
    }
    form_str <- stringr::str_replace(
      form_str,
      paste0("\\b", control, "\\b"),
      paste0(treat, " * ", control)
    )
  }
  return(as.formula(form_str))
}

Here are some examples with expected output:

# Expected outputs 
reformulas_addints(mpg ~ cyl + gear, "cyl", c("gear"))
# mpg ~ cyl + cyl * gear
# also acceptable
# mpg ~ cyl + gear + cyl:gear
reformulas_addints(mpg ~ cyl + gear + disp, "cyl", c("gear", "disp"))
# mpg ~ cyl + cyl * gear + cyl * disp
# also acceptable
# mpg ~ cyl + gear + disp + cyl:gear + cyl:disp
reformulas_addints(mpg ~ cyl + gear + disp + hp, "cyl", c("gear", "disp"))
# mpg ~ cyl + cyl * gear + cyl * disp + hp
# also acceptable
# mpg ~ cyl + gear + disp + cyl:gear + cyl:disp + hp
# Notice that `hp` is _not_ interacted
reformulas_addints(mpg ~ cyl + gear, "cyl", c("gears"))
# Error: The variable 'gears' is not in the formula.
reformulas_addints(mpg ~ cyl + cyl*gear, "cyl", c("gear"))
# Error: The interaction 'cyl * gear' is already in the formula.

Solution

  • Use terms() to find out the items in your formula, and update() to modify it. I don't really know how low-level you want to go with "first principles", but this seems to do what you want:

    reformulas_addints <- function(form, treat, controls) {
      stopifnot (inherits(form, "formula"))
      terms <- terms(form)
      
      variables <- attr(terms, "term.labels")
      
      stopifnot(controls %in% variables, treat %in% variables)
      
      for (control in controls) {
        new <- as.formula(paste("~ . + ", control, ":", treat))
        form <- update(form, new)
      }
      
      form
    }
    

    Here is an example:

    f <- mpg ~ cyl + gear + disp
    
    reformulas_addints(f, "cyl", c("gear", "disp"))
    #> mpg ~ cyl + gear + disp + cyl:gear + cyl:disp
    

    Created on 2025-08-29 with reprex v2.1.1