I am invoking ranger to model a multi-classification problem of a big mixed-data frame (where some categorical variables have more than 53 levels). Training and Testsing runs without any problem. However, interpretting confusion matrix/ contigency table gives hiccups.
I am using iris data rather to explain the difficulties I am facing, by treating Species as the classification variable,
library(ranger)
library(caret)
# Data
idx = sample(nrow(iris),100)
data = iris
# Split data sets
Train_Set = data[idx,]
Test_Set = data[-idx,]
# Train
Species.ranger <- ranger(Species ~ ., ,data=Train_Set,importance="impurity", save.memory = TRUE, probability=TRUE)
# Test
probabilitiesSpecies <- predict(Species.ranger, data = Test_Set,type='response', verbose = TRUE)
or
probabilitiesSpecies <- as.data.frame(predict(Species.ranger, data = Test_Set,type='response', verbose = TRUE)$predictions)
the following difficulties are encountered:
table(Test_Set$Species, probabilitiesSpecies$predictions)
Error in table(Test_Set$Species, probabilitiesSpecies$predictions) :
all arguments must have the same length
or
caret::confusionMatrix(Test_Set$Species, probabilitiesSpecies$predictions)
or
caret::confusionMatrix(table(Test_Set$Species, max.col(probabilitiesSpecies)-1))
gives
Error: `data` and `reference` should be factors with the same levels.
A biclassification shown below, however, works:
idx = sample(nrow(iris),100)
data = iris
data$Species = factor(ifelse(data$Species=="virginica",1,0))
Train_Set = data[idx,]
Test_Set = data[-idx,]
# Train
Species.ranger <- ranger(Species ~ ., ,data=Train_Set,importance="impurity", save.memory = TRUE, probability=TRUE)
# Test
probabilitiesSpecies <- as.data.frame(predict(Species.ranger, data = Test_Set,type='response', verbose = TRUE)$predictions)
caret::confusionMatrix(table(max.col(probabilitiesSpecies)-1, Test_Set$Species))
How can this issue be tackled for multi-classification to get the confusion matrix? I have posed this as a seperate thread too (Error while computing confusion matrix for multiclassification using ranger)
In the ranger
-documentation, the following is said when probabilities = TRUE
,
With the probability option and factor dependent variable a probability forest is grown. Here, the node impurity is used for splitting, as in classification forests. Predictions are class probabilities for each sample. In contrast to other implementations, each tree returns a probability estimate and these estimates are averaged for the forest probability estimate. For details see Malley et al. (2012).
Ie. when set to TRUE
, you will get probability estimates which you can then classify according to your own threshold-values. I do not know the default decision rule if set to FALSE
, however.
In any case, your approach should be the following,
Species.ranger <- ranger(
Species ~ .,
data = Train_Set,
importance ="impurity",
save.memory = TRUE,
probability = FALSE
)
Which then can be evaluated for performance in the confusionMatrix
the following way,
probabilitiesSpecies <- predict(
Species.ranger,
data = Test_Set,
verbose = TRUE
)
table(
probabilitiesSpecies$predictions,
Test_Set$Species
) %>% confusionMatrix()
Output
Confusion Matrix and Statistics
setosa versicolor virginica
setosa 17 0 0
versicolor 0 16 1
virginica 0 0 16
Overall Statistics
Accuracy : 0.98
95% CI : (0.8935, 0.9995)
No Information Rate : 0.34
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.97
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.00 1.0000 0.9412
Specificity 1.00 0.9706 1.0000
Pos Pred Value 1.00 0.9412 1.0000
Neg Pred Value 1.00 1.0000 0.9706
Prevalence 0.34 0.3200 0.3400
Detection Rate 0.34 0.3200 0.3200
Detection Prevalence 0.34 0.3400 0.3200
Balanced Accuracy 1.00 0.9853 0.9706