This is a question directly related to the answer provided here: MLR random forest multi label get feature importance
To summarize, the question is about producing a variable importance plot for a multi-label classification problem. I am coping the code provided by another person to produce the vimp plot:
library(mlr)
yeast = getTaskData(yeast.task)
labels = colnames(yeast)[1:14]
yeast.task = makeMultilabelTask(id = "multi", data = yeast, target = labels)
lrn.rfsrc = makeLearner("multilabel.randomForestSRC")
mod2 = train(lrn.rfsrc, yeast.task)
vi =randomForestSRC::vimp(mod2$learner.model)
plot(vi,m.target ="label2")
I am not sure what TRUE, FALSE, and All in the randomForestSRC::vimp plot mean. I read the package documentation and still could not figure it out.
How does that distinction (TRUE, FALSE, All) work?
In that example, you have 14 possible labels. If you look at the data:
head(yeast)
label1 label2 label3 label4 label5 label6 label7 label8 label9 label10
1 FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
2 FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
3 FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
4 FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
5 TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
6 FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
For every label, for example label2, there are two classes, TRUE / FALSE. So in that plot, In this plot, all is the overall err rate or proportion of the predictions that are wrong for all your samples. TRUE / FALSE is for TRUE / FALSE labels separately. So from this plot, you can see the error in TRUE is higher, meaning the model has problems predicting TRUE correctly.
We can check this by looking at the oob predicted labels:
oob_labels = c(TRUE,FALSE)[max.col(vi$classOutput$label2$predicted.oob)]
table(yeast$label2, oob_labels)
oob_labels
FALSE TRUE
FALSE 1175 204
TRUE 614 424
You can see for the TRUE labels (2nd row), you get 614/(614+424) = 0.5915222 wrong. This is roughly what you see in the plot, error rate for TRUE label is ~ 0.6.
As for the 2nd variable importance plot, it is along the same lines, variable importance for overall, or TRUE/FALSE class, you can look it like:
par(mfrow=c(1,3))
for(i in colnames(mat)){barplot(mat[,i],horiz=TRUE,las=2)}