R mtcars dataset model selection - model changes dramatically when including am

I`m trying to do some model fitting for the mtcars dataset. I did a model with only transmission included which gives me this :

Since the adj R^2 is only .338 I was looking for another model.

To do this, I sarted by fitting all possible models with just 1 variable ( mpg~wt , mpg~cyl , mpg~hp , ... and take the model with the highest adjusted R^2 which turns out to be mpg~wt

Then I fit all possible models with wt + one other variable and took the model with the highest adjusted R^2 where the p-value is not getting higher. This is model mpg ~ wt+cyl.

Then I fit all possible models with wt + cyl + one other variable and looked for the one with the highest adjusted R^2 where the p-value is not getting higher. I found that there is no other model with higher R^2 and smaller p-value. (There is mpg ~ wt+cyl+hp but its p-value is higher than that of mpg ~ wt+cyl)

So, Here comes my problem:

When I now include am to see if there is a difference everything changes: adjustedR^2 gets better; p gets worse. But all of a sudden cyl is not significant any more. ( it changed from Pr 0.000222 to 0.2119) From the Coefficients-Output I would clearly exclude am from the list since its Probability of not beeing significant is 31,42%.

What is going on here? From my very first model ( mpg~am) I concluded there should be some significance of am in the mtcars-dataset.

Solution

The inherent risk of adding more terms to your model is collinearity of the independent variables. Linear regression assumes that the independent variables are, as described, independent of each other.

As you described, a model using wt and cyl looks like this

library(rms) 
library(broom)

fit0 <- lm(mpg ~ wt + cyl, data = mtcars)
tidy(fit0)

         term  estimate std.error statistic      p.value
1 (Intercept) 39.686261 1.7149840 23.140893 3.043182e-20
2          wt -3.190972 0.7569065 -4.215808 2.220200e-04
3         cyl -1.507795 0.4146883 -3.635972 1.064282e-03

And the model that adds am

fit1 <- lm(mpg ~ wt + cyl + am, data = mtcars)
tidy(fit1)

         term   estimate std.error  statistic      p.value
1 (Intercept) 39.4179334 2.6414573 14.9227979 7.424998e-15
2          wt -3.1251422 0.9108827 -3.4308942 1.885894e-03
3         cyl -1.5102457 0.4222792 -3.5764148 1.291605e-03
4          am  0.1764932 1.3044515  0.1353007 8.933421e-01

Comparing, the coefficient of the cyl variable changes from -1.507 to -1.510; not a big change. The standard error change from 0.414 to 0.422; not a big change. And while the p-value did get larger, it wasn't by a whole lot.

The model you display that actually changes a few things also includes hp. Let's look at this model:

fit2 <- lm(mpg ~ wt + cyl + am + hp, data = mtcars)
tidy(fit2)

         term    estimate  std.error statistic      p.value
1 (Intercept) 36.14653575 3.10478079 11.642218 4.944804e-12
2          wt -2.60648071 0.91983749 -2.833632 8.603218e-03
3         cyl -0.74515702 0.58278741 -1.278609 2.119166e-01
4          am  1.47804771 1.44114927  1.025603 3.141799e-01
5          hp -0.02495106 0.01364614 -1.828433 7.855337e-02

In this case, the magnitude of the cyl coefficient got smaller, and the standard error increased from 0.422 to 0.582. To compare, the standard error of am only when from 1.304 to 1.441; and to contrast, the standard error of wt only went from 0.910 to 0.919 (excuse my poor rounding). You should notice that the p-value for wt didn't change much, but the p-values for cyl and am are much larger when you include hp.

This suggests that there is some form or correlation between some of the independent variables. In other words, the independent variables aren't truly independent. The result is that the standard errors for associated variables become inflated. Since t = estimate / std.error, a larger standard error results in a smaller t-value, which results in a larger p-value.

When building models, you should remember that the model assumes independence between the predictive variables. One good way to look at this is to use the Variance Inflation Factor. For our models, we get the following

vif(fit0)
      wt      cyl 
2.579312 2.579312 
vif(fit1)
      wt      cyl       am 
3.609011 2.584066 1.924955 
vif(fit2)
      wt      cyl       am       hp 
3.988305 5.333685 2.546159 4.310029

You'll notice that when we add hp, the VIF doubles for the cyl variable. As it turns out, having more cylinders in the engine increases the horse power. Including these two variables violates the independence assumption.

If we are trying to build a model with these variables, we would be wise to compare the models where mpg ~ wt + am + cyl and mpg ~ wt + am + hp. It turns out that the model with hp has a slightly better R-squared value (and lower AIC), and is probably the better model. That's hard to see, because when you build your model sequentially as you have done, cyl seems to be preferable as the second variable to add. But if you include hp, the combination of hp and am has better properties than the combination of cyl and am. This is why automated methods such as stepAIC and random forest are so popular; they can explore a lot of these nuances rather quickly.

One additional note: using cyl as a numeric variable is probably not going to fit a realistic model. cyl only takes three values, 4, 6, and 8. 3, 5, and 7 cylinder engines are pretty rare, so cyl is probably better treated as a factor. Sometimes this can have an effect of your model fit (though not so much in this particular case)