rstatisticslinear-regression

Is this a mistake in R? To my understanding the output should be the same


Recently, I was comparing two statistics exercises and found out that different outputs for the same input in R is perhaps unintended behavior of R, right?

model1 <- lm(rent ~ area + bath, data = rent99)
coefficients1 <- coef(model1)

# Using a matrix without an intercept column
X <- cbind(rent99$area, rent99$bath)
model2 <- lm(rent99$rent ~ X[, 1] + X[, 2])
coefficients2 <- coef(model2)

# Both coefficients1 and coefficients2 should be identical
coefficients1
coefficients2

Output:

(Intercept)        area       bath1 
 144.149195    4.587025  100.661413 
(Intercept)      X[, 1]      X[, 2] 
  43.487782    4.587025  100.661413

I would assume the coefficients to be identical, because the input data is identical


Solution

  • bath is a factor variable.

    Let's reproduce:

    set.seed(42)
    x <- sample(0:1, 100, TRUE)
    DF <- data.frame(x = factor(x),
                     y = 0.1 + 5 * x + rnorm(100))
    
    coef(lm(y ~ x, data = DF))
    #(Intercept)          x1 
    # 0.03815139  5.06531032 
    
    coef(lm(DF$y ~ cbind(DF$x)))
    #(Intercept) cbind(DF$x) 
    #-5.027159    5.065310 
    

    The issue is your use of cbind. It produces a matrix and a matrix can only hold one data type and it cannot hold S3 objects (such as a factor).

    Thus, cbind works like as.numeric in your example:

    as.numeric(DF$x)
    #  [1] 1 1 1 1 2 2 2 2 1 2 1 2 1 2 1 1 2 2 2 2 1 1 1 1 1 2 1 1 1 1 2 2 2 2 1 2 1 2 2 2 1 1 2 2 2 2 2 2 2 1 2 1 2 2 2 2 1 2 1 1 1 2 2 2 2 2 2 1 2 1 2 1 2 2 2
    # [76] 2 1 1 1 1 2 1 2 1 1 2 2 1 1 1 1 2 1 2 2 2 1 2 2 2
    

    As you see, that returns the internal integers of the factor variable. Basically, you recoded that variable from 0/1 to 1/2. That's why the second intercept is 144.149195 - 1 * 100.661413.