I`m trying to do some model fitting for the mtcars dataset. I did a model with only transmission included which gives me this :
Since the adj R^2 is only .338 I was looking for another model.
To do this, I sarted by fitting all possible models with just 1 variable ( mpg~wt , mpg~cyl , mpg~hp , ... and take the model with the highest adjusted R^2 which turns out to be mpg~wt
Then I fit all possible models with wt + one other variable and took the model with the highest adjusted R^2 where the p-value is not getting higher. This is model mpg ~ wt+cyl.
Then I fit all possible models with wt + cyl + one other variable and looked for the one with the highest adjusted R^2 where the p-value is not getting higher. I found that there is no other model with higher R^2 and smaller p-value. (There is mpg ~ wt+cyl+hp but its p-value is higher than that of mpg ~ wt+cyl)
So, Here comes my problem:
When I now include am to see if there is a difference everything changes: adjustedR^2 gets better; p gets worse. But all of a sudden cyl is not significant any more. ( it changed from Pr 0.000222 to 0.2119) From the Coefficients-Output I would clearly exclude am from the list since its Probability of not beeing significant is 31,42%.
What is going on here? From my very first model ( mpg~am) I concluded there should be some significance of am in the mtcars-dataset.
The inherent risk of adding more terms to your model is collinearity of the independent variables. Linear regression assumes that the independent variables are, as described, independent of each other.
As you described, a model using wt
and cyl
looks like this
library(rms)
library(broom)
fit0 <- lm(mpg ~ wt + cyl, data = mtcars)
tidy(fit0)
term estimate std.error statistic p.value
1 (Intercept) 39.686261 1.7149840 23.140893 3.043182e-20
2 wt -3.190972 0.7569065 -4.215808 2.220200e-04
3 cyl -1.507795 0.4146883 -3.635972 1.064282e-03
And the model that adds am
fit1 <- lm(mpg ~ wt + cyl + am, data = mtcars)
tidy(fit1)
term estimate std.error statistic p.value
1 (Intercept) 39.4179334 2.6414573 14.9227979 7.424998e-15
2 wt -3.1251422 0.9108827 -3.4308942 1.885894e-03
3 cyl -1.5102457 0.4222792 -3.5764148 1.291605e-03
4 am 0.1764932 1.3044515 0.1353007 8.933421e-01
Comparing, the coefficient of the cyl
variable changes from -1.507 to -1.510; not a big change. The standard error change from 0.414 to 0.422; not a big change. And while the p-value did get larger, it wasn't by a whole lot.
The model you display that actually changes a few things also includes hp
. Let's look at this model:
fit2 <- lm(mpg ~ wt + cyl + am + hp, data = mtcars)
tidy(fit2)
term estimate std.error statistic p.value
1 (Intercept) 36.14653575 3.10478079 11.642218 4.944804e-12
2 wt -2.60648071 0.91983749 -2.833632 8.603218e-03
3 cyl -0.74515702 0.58278741 -1.278609 2.119166e-01
4 am 1.47804771 1.44114927 1.025603 3.141799e-01
5 hp -0.02495106 0.01364614 -1.828433 7.855337e-02
In this case, the magnitude of the cyl
coefficient got smaller, and the standard error increased from 0.422 to 0.582. To compare, the standard error of am
only when from 1.304 to 1.441; and to contrast, the standard error of wt
only went from 0.910 to 0.919 (excuse my poor rounding). You should notice that the p-value for wt
didn't change much, but the p-values for cyl
and am
are much larger when you include hp
.
This suggests that there is some form or correlation between some of the independent variables. In other words, the independent variables aren't truly independent. The result is that the standard errors for associated variables become inflated. Since t = estimate / std.error
, a larger standard error results in a smaller t-value, which results in a larger p-value.
When building models, you should remember that the model assumes independence between the predictive variables. One good way to look at this is to use the Variance Inflation Factor. For our models, we get the following
vif(fit0)
wt cyl
2.579312 2.579312
vif(fit1)
wt cyl am
3.609011 2.584066 1.924955
vif(fit2)
wt cyl am hp
3.988305 5.333685 2.546159 4.310029
You'll notice that when we add hp
, the VIF doubles for the cyl
variable. As it turns out, having more cylinders in the engine increases the horse power. Including these two variables violates the independence assumption.
If we are trying to build a model with these variables, we would be wise to compare the models where mpg ~ wt + am + cyl
and mpg ~ wt + am + hp
. It turns out that the model with hp
has a slightly better R-squared value (and lower AIC), and is probably the better model. That's hard to see, because when you build your model sequentially as you have done, cyl
seems to be preferable as the second variable to add. But if you include hp
, the combination of hp
and am
has better properties than the combination of cyl
and am
. This is why automated methods such as stepAIC
and random forest are so popular; they can explore a lot of these nuances rather quickly.
One additional note: using cyl
as a numeric variable is probably not going to fit a realistic model. cyl
only takes three values, 4, 6, and 8. 3, 5, and 7 cylinder engines are pretty rare, so cyl
is probably better treated as a factor. Sometimes this can have an effect of your model fit (though not so much in this particular case)