I have a cross-section data set repeated for 2 years, 2009 and 2010. I am using the first year (2009) as a training set to train a Random Forest for a regression problem and the second year (2010) as a test set.
df <- read.csv("https://www.dropbox.com/s/t4iirnel5kqgv34/df.cv?dl=1")
After training the Random Forest for 2009 the variable importance indicates the variable x1
is the most important one.
set.seed(89)
rf2009 <- randomForest(y ~ x1 + x2 + x3 + x4 + x5 + x6,
data = df[df$year==2009,],
ntree=500,
mtry = 6,
importance = TRUE)
print(rf2009)
Call:
randomForest(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6, data = df[df$year == 2009, ], ntree = 500, mtry = 6, importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 6
Mean of squared residuals: 5208746
% Var explained: 75.59
imp.all <- as.data.frame(sort(importance(rf2009)[,1],decreasing = TRUE),optional = T)
names(imp.all) <- "% Inc MSE"
imp.all
% Inc MSE
x1 35.857840
x2 16.693059
x3 15.745721
x4 15.105710
x5 9.002924
x6 6.160413
I then move on to the test set and I receive the following accuracy metrics.
test.pred.all <- predict(rf2009,df[df$year==2010,])
RMSE.forest.all <- sqrt(mean((test.pred.all-df[df$year==2010,]$y)^2))
RMSE.forest.all
[1] 2258.041
MAE.forest.all <- mean(abs(test.pred.all-df[df$year==2010,]$y))
MAE.forest.all
[1] 299.0751
When I then train the model without the variable x1
, which was the most important one as per the above, and apply the trained model on the test set, I observe the following:
the variance explained with x1
is higher than without x1
as expected
but the RMSE
for the test data is better without x1
(RMSE
: 2258.041 with x1
vs. 1885.462 without x1
)
nevertheless MAE
is slightly better with x1
(299.0751) vs. without it (302.3382).
rf2009nox1 <- randomForest(y ~ x2 + x3 + x4 + x5 + x6,
data = df[df$year==2009,],
ntree=500,
mtry = 5,
importance = TRUE)
print(rf2009nox1)
Call:
randomForest(formula = y ~ x2 + x3 + x4 + x5 + x6, data = df[df$year == 2009, ], ntree = 500, mtry = 5, importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 5
Mean of squared residuals: 6158161
% Var explained: 71.14
imp.nox1 <- as.data.frame(sort(importance(rf2009nox1)[,1],decreasing = TRUE),optional = T)
names(imp.nox1) <- "% Inc MSE"
imp.nox1
% Inc MSE
x2 37.369704
x4 11.817910
x3 11.559375
x5 5.878555
x6 5.533794
test.pred.nox1 <- predict(rf2009nox1,df[df$year==2010,])
RMSE.forest.nox1 <- sqrt(mean((test.pred.nox1-df[df$year==2010,]$y)^2))
RMSE.forest.nox1
[1] 1885.462
MAE.forest.nox1 <- mean(abs(test.pred.nox1-df[df$year==2010,]$y))
MAE.forest.nox1
[1] 302.3382
I am aware that the variable importance refers to the training model and not to the test one, but does this mean that the x1
variable should not be included in the model?
So, should I include x1
in the model?
I think you need more information about the performance of the model. With only one test sample you could speculate a lot why the RMSE is better without x1 although x1 has the highest importance. Could be a correlation between variables or explaining from noise in the train set.
To get more information I would recommend to look at the out of bag error and do hyperparameter optimization with cross-validation. If you see the same behavior after testing different Test datasets you could do cross-validation with and without x1.