rdataset

Between in dplyr with lm function


I am testing for outliers using the iris dataset

mod <- lm(Sepal.Width ~ Sepal.Length*Species, data = iris)

I use rstudent() to calculate the studentized residuals, and add an indicator whether the value is outside the range [-2, 2].

iris2 <-
  iris |> 
  mutate(res_stud = rstudent(mod),
         res_stud_large = as.numeric(!between(res_stud, -2, 2)))

but I get this error:

Error in `mutate()`:
ℹ In argument: `res_stud_large = as.numeric(!between(res_stud, -2, 2))`.
Caused by error:
! length(g) must match nrow(X)
Backtrace:
  1. dplyr::mutate(...)
 13. base::stop(`<Rcpp::xc>`)
> 

I checked that

str(rstudent(mod))

 Named num [1:150] -0.0113 -1.2776 0.0609 -0.0142 0.6545 ...
 - attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...

Probably because of this, I get this error?

I tried using subset function but without success.


Solution

  • I think there may be something else going on here. Using just dplyr and the iris it works.

    library(dplyr)
    mod <- lm(Sepal.Width ~ Sepal.Length*Species, data = iris)
    iris2 <-
      iris |> 
      mutate(res_stud = rstudent(mod),
             res_stud_large = as.numeric(!between(res_stud, -2, 2)))
    

    This works because the iris data are complete (no NA values). If we impose a missing value, you'll see that it fails in the same way as your example:

    iris$Species[1] <- NA
    
    mod <- lm(Sepal.Width ~ Sepal.Length*Species, data = iris)
    iris2 <-
      iris |> 
      mutate(res_stud = rstudent(mod),
             res_stud_large = as.numeric(!between(res_stud, -2, 2)))
    #> Error in `mutate()`:
    #> ℹ In argument: `res_stud = rstudent(mod)`.
    #> Caused by error:
    #> ! `res_stud` must be size 150 or 1, not 149.
    

    If you estimate the model with na.action = na.exclude, then when R returns things like fitted values or residuals, it will do so including the NA values for the cases that were not used in the analysis - making the output the same size as the original input.

    mod2 <- lm(Sepal.Width ~ Sepal.Length*Species, data = iris, 
               na.action = na.exclude)
    iris2 <- iris |> 
      mutate(res_stud = rstudent(mod2),
             res_stud_large = as.numeric(!between(res_stud, -2, 2)))
    

    I wonder if something like this happened along the way that wasn't documented in your example?

    Created on 2025-03-16 with reprex v2.1.1.9000