
Variable selection in Random forest and prediction accuracy

I have a cross-section data set repeated for 2 years, 2009 and 2010. I am using the first year (2009) as a training set to train a Random Forest for a regression problem and the second year (2010) as a test set.

Load the data

df <- read.csv("https://www.dropbox.com/s/t4iirnel5kqgv34/df.cv?dl=1")

After training the Random Forest for 2009 the variable importance indicates the variable x1 is the most important one.

Random Forest using all variables

rf2009 <- randomForest(y ~ x1 + x2 + x3 + x4 + x5 + x6,
                         data = df[df$year==2009,], 
                         mtry = 6,
                         importance = TRUE)

 randomForest(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6, data = df[df$year ==      2009, ], ntree = 500, mtry = 6, importance = TRUE) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 6

          Mean of squared residuals: 5208746
                    % Var explained: 75.59

Variable importance

imp.all <- as.data.frame(sort(importance(rf2009)[,1],decreasing = TRUE),optional = T)
names(imp.all) <- "% Inc MSE"

% Inc MSE
x1 35.857840
x2 16.693059
x3 15.745721
x4 15.105710
x5  9.002924
x6  6.160413

I then move on to the test set and I receive the following accuracy metrics.

Prediction and evaluation on the test set

test.pred.all <- predict(rf2009,df[df$year==2010,])
RMSE.forest.all <- sqrt(mean((test.pred.all-df[df$year==2010,]$y)^2))
[1] 2258.041

MAE.forest.all <- mean(abs(test.pred.all-df[df$year==2010,]$y))
[1] 299.0751

When I then train the model without the variable x1, which was the most important one as per the above, and apply the trained model on the test set, I observe the following:

Random Forest excluding x1

rf2009nox1 <- randomForest(y ~ x2 + x3 + x4 + x5 + x6,
                       data = df[df$year==2009,], 
                       mtry = 5,
                       importance = TRUE)

 randomForest(formula = y ~ x2 + x3 + x4 + x5 + x6, data = df[df$year ==      2009, ], ntree = 500, mtry = 5, importance = TRUE) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 5

          Mean of squared residuals: 6158161
                    % Var explained: 71.14

Variable importance

imp.nox1 <- as.data.frame(sort(importance(rf2009nox1)[,1],decreasing = TRUE),optional = T)
names(imp.nox1) <- "% Inc MSE"

   % Inc MSE
x2 37.369704
x4 11.817910
x3 11.559375
x5  5.878555
x6  5.533794

Prediction and evaluation on the test set

test.pred.nox1 <- predict(rf2009nox1,df[df$year==2010,])
RMSE.forest.nox1 <- sqrt(mean((test.pred.nox1-df[df$year==2010,]$y)^2))
[1] 1885.462

MAE.forest.nox1 <- mean(abs(test.pred.nox1-df[df$year==2010,]$y))
[1] 302.3382

I am aware that the variable importance refers to the training model and not to the test one, but does this mean that the x1 variable should not be included in the model?

So, should I include x1 in the model?


  • I think you need more information about the performance of the model. With only one test sample you could speculate a lot why the RMSE is better without x1 although x1 has the highest importance. Could be a correlation between variables or explaining from noise in the train set.

    To get more information I would recommend to look at the out of bag error and do hyperparameter optimization with cross-validation. If you see the same behavior after testing different Test datasets you could do cross-validation with and without x1.