I'm using varImp
function from R
package caret
to get importance of variables. This is my code:
library(caret)
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 20,
search = "grid",summaryFunction = youdenSumary)
classifier = train(form = Target ~ ., data = training_set, method = 'rpart',
parms = list(split = "information"),trControl=trctrl,
tuneLength = 10,metric = "j")
importance <- varImp(classifier, scale=FALSE)
This is the resulting variables importance:
rpart variable importance
Overall
nh 532.218
nRT 488.922
wdSu 482.582
av_t 390.266
nc 317.725
o 303.738
dt 291.488
wdMo 103.200
wdSa 49.690
ne 46.707
wdWe 41.642
nl 26.463
wdTu 9.506
wdTh 2.669
The code runs the recursive partitioning algorithm and keep track of how much each split reduces the loss function. But... what is the loss function in this case? The Rdocumentation says:
The reduction in the loss function (e.g. mean squared error) attributed to each variable at each split is tabulated and the sum is returned. Also, since there may be candidate variables that are important but are not used in a split, the top competing variables are also tabulated at each split. This can be turned off using the maxcompete argument in rpart.control. This method does not currently provide class-specific measures of importance when the response is a factor.
It mentions the mean squared error. Is this the loss function used in this package (I'm not sure about that "e.g." in round brackets)?
Mean squared error is used for regression. You can check the long intro for rpart, since you are doing classification, there are two impurity functions, gini and information entropy:
You specified :
parms = list(split = "information")
This means you are splitting your tree based on information entropy. In your case, the reduction refers to the reduction in information entropy. You can check the function used by caret by doing:
caret:::varImpDependencies("rpart")$varImp
It's basically summing up the improvement in information entropy per split, you can roughly check it in your case by doing:
classifier$finalModel$splits