Recently, I was comparing two statistics exercises and found out that different outputs for the same input in R is perhaps unintended behavior of R, right?
model1 <- lm(rent ~ area + bath, data = rent99)
coefficients1 <- coef(model1)
# Using a matrix without an intercept column
X <- cbind(rent99$area, rent99$bath)
model2 <- lm(rent99$rent ~ X[, 1] + X[, 2])
coefficients2 <- coef(model2)
# Both coefficients1 and coefficients2 should be identical
coefficients1
coefficients2
Output:
(Intercept) area bath1
144.149195 4.587025 100.661413
(Intercept) X[, 1] X[, 2]
43.487782 4.587025 100.661413
I would assume the coefficients to be identical, because the input data is identical
bath
is a factor variable.
Let's reproduce:
set.seed(42)
x <- sample(0:1, 100, TRUE)
DF <- data.frame(x = factor(x),
y = 0.1 + 5 * x + rnorm(100))
coef(lm(y ~ x, data = DF))
#(Intercept) x1
# 0.03815139 5.06531032
coef(lm(DF$y ~ cbind(DF$x)))
#(Intercept) cbind(DF$x)
#-5.027159 5.065310
The issue is your use of cbind
. It produces a matrix and a matrix can only hold one data type and it cannot hold S3 objects (such as a factor).
Thus, cbind
works like as.numeric
in your example:
as.numeric(DF$x)
# [1] 1 1 1 1 2 2 2 2 1 2 1 2 1 2 1 1 2 2 2 2 1 1 1 1 1 2 1 1 1 1 2 2 2 2 1 2 1 2 2 2 1 1 2 2 2 2 2 2 2 1 2 1 2 2 2 2 1 2 1 1 1 2 2 2 2 2 2 1 2 1 2 1 2 2 2
# [76] 2 1 1 1 1 2 1 2 1 1 2 2 1 1 1 1 2 1 2 2 2 1 2 2 2
As you see, that returns the internal integers of the factor variable. Basically, you recoded that variable from 0/1 to 1/2. That's why the second intercept is 144.149195 - 1 * 100.661413
.