rcaretrpart

rpart variable importance shows more variables than decision tree plots


I fitted an rpart model in Leave One Out Cross Validation on my data using Caret library in R. Everything is ok, but I want to understand the difference between model's variable importance and decision tree plot.

Calling the variable importance with the function varImp() shows nine variables. Plotting the decision tree using functions such as fancyRpartPlot() or rpart.plot() shows a decision tree that uses only two variables to classify all subjects.

How can it be? Why does the decision tree plot not shows the same nine variables from the variable importance table?

Thank you.

varimp plot

enter image description here


Solution

  • Similar to rpart(), Caret has a cool property: it deals with surrogate variables, i.e. variables that are not chosen for splits, but that were close to win the competition.

    Let me be more clear. Say at a given split, the algorithm decided to split on x1. Suppose also there is another variable, say x2, which would be almost as good as x1 for splitting at that stage. We call x2 surrogate, and we assign it its variable importance as we do for x1.

    This is way you can get in the importance ranking variables that are actually not used for splitting. You can also find that such variables are more important than others actuall used!

    The rationale for this is explained in the documentation for rpart(): suppose we have two identical covariates, say x3 and x4. Then rpart() is likely to split on one of them only, e.g., x3. How can we say that x4 is not important?

    To conclude, variable importance considers the increase in fit for both primary variables (i.e., the ones actually chosen for splitting) and surrogate variables. So, the importance for x1 considers both splits for which x1 is chosen as splitting variable, and splits for which another variables is chosen but x1 is a close competitor.

    Hope this clarifies your doubts. For more details, see here. Just a quick quotation:

    The following methods for estimating the contribution of each variable to the model are available [speaking of how variable importance is computed]:

    [...]

    - Recursive Partitioning: The reduction in the loss function (e.g. mean squared error) attributed to each variable at each split is tabulated and the sum is returned. Also, since there may be candidate variables that are important but are not used in a split, the top competing variables are also tabulated at each split. This can be turned off using the maxcompete argument in rpart.control.

    I am not used to caret, but from this quote it appears that such package actually uses rpart() to grow trees, thus inheriting the property about surrogate variables.