rmachine-learningrandom-forestmultilabel-classificationmlr

How to interpret the variable importance plot produced via randomForestSRC::vimp?


This is a question directly related to the answer provided here: MLR random forest multi label get feature importance

To summarize, the question is about producing a variable importance plot for a multi-label classification problem. I am coping the code provided by another person to produce the vimp plot:

library(mlr)
yeast = getTaskData(yeast.task)
labels = colnames(yeast)[1:14]
yeast.task = makeMultilabelTask(id = "multi", data = yeast, target = labels)
lrn.rfsrc = makeLearner("multilabel.randomForestSRC")
mod2 = train(lrn.rfsrc, yeast.task)

vi =randomForestSRC::vimp(mod2$learner.model)
plot(vi,m.target ="label2")

I am not sure what TRUE, FALSE, and All in the randomForestSRC::vimp plot mean. I read the package documentation and still could not figure it out.

How does that distinction (TRUE, FALSE, All) work?


Solution

  • In that example, you have 14 possible labels. If you look at the data:

    head(yeast)
      label1 label2 label3 label4 label5 label6 label7 label8 label9 label10
    1  FALSE  FALSE   TRUE   TRUE  FALSE  FALSE  FALSE  FALSE  FALSE   FALSE
    2  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE   TRUE   TRUE  FALSE   FALSE
    3  FALSE   TRUE   TRUE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE   FALSE
    4  FALSE  FALSE   TRUE   TRUE  FALSE  FALSE  FALSE  FALSE  FALSE   FALSE
    5   TRUE   TRUE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE   FALSE
    6  FALSE  FALSE   TRUE   TRUE  FALSE  FALSE  FALSE  FALSE  FALSE   FALSE
    

    For every label, for example label2, there are two classes, TRUE / FALSE. So in that plot, In this plot, all is the overall err rate or proportion of the predictions that are wrong for all your samples. TRUE / FALSE is for TRUE / FALSE labels separately. So from this plot, you can see the error in TRUE is higher, meaning the model has problems predicting TRUE correctly.

    enter image description here

    We can check this by looking at the oob predicted labels:

    oob_labels = c(TRUE,FALSE)[max.col(vi$classOutput$label2$predicted.oob)]
    table(yeast$label2, oob_labels)
    
           oob_labels
            FALSE TRUE
      FALSE  1175  204
      TRUE    614  424
    

    You can see for the TRUE labels (2nd row), you get 614/(614+424) = 0.5915222 wrong. This is roughly what you see in the plot, error rate for TRUE label is ~ 0.6.

    As for the 2nd variable importance plot, it is along the same lines, variable importance for overall, or TRUE/FALSE class, you can look it like:

    par(mfrow=c(1,3))
    for(i in colnames(mat)){barplot(mat[,i],horiz=TRUE,las=2)}
    

    enter image description here