rmachine-learningclassificationconfusion-matrixr-ranger

Error in calculating confusion matrix or contigency table for multiclassification using ranger


I am invoking ranger to model a multi-classification problem of a big mixed-data frame (where some categorical variables have more than 53 levels). Training and Testsing runs without any problem. However, interpretting confusion matrix/ contigency table gives hiccups.

I am using iris data rather to explain the difficulties I am facing, by treating Species as the classification variable,

library(ranger)
library(caret)

# Data
idx = sample(nrow(iris),100)
data = iris

# Split data sets
Train_Set = data[idx,]
Test_Set = data[-idx,]

# Train
Species.ranger <- ranger(Species ~ ., ,data=Train_Set,importance="impurity", save.memory = TRUE, probability=TRUE)

# Test
probabilitiesSpecies <- predict(Species.ranger, data = Test_Set,type='response', verbose = TRUE)
or
probabilitiesSpecies <- as.data.frame(predict(Species.ranger, data = Test_Set,type='response', verbose = TRUE)$predictions)

the following difficulties are encountered:

table(Test_Set$Species, probabilitiesSpecies$predictions)

Error in table(Test_Set$Species, probabilitiesSpecies$predictions) : 
all arguments must have the same length

or

caret::confusionMatrix(Test_Set$Species, probabilitiesSpecies$predictions)
or
caret::confusionMatrix(table(Test_Set$Species, max.col(probabilitiesSpecies)-1))
gives
Error: `data` and `reference` should be factors with the same levels.

A biclassification shown below, however, works:

idx = sample(nrow(iris),100)
data = iris
data$Species = factor(ifelse(data$Species=="virginica",1,0))

Train_Set = data[idx,]
Test_Set = data[-idx,]

# Train
Species.ranger <- ranger(Species ~ ., ,data=Train_Set,importance="impurity", save.memory = TRUE, probability=TRUE)

# Test
probabilitiesSpecies <- as.data.frame(predict(Species.ranger, data = Test_Set,type='response', verbose = TRUE)$predictions)

caret::confusionMatrix(table(max.col(probabilitiesSpecies)-1, Test_Set$Species))

How can this issue be tackled for multi-classification to get the confusion matrix? I have posed this as a seperate thread too (Error while computing confusion matrix for multiclassification using ranger)


Solution

  • In the ranger-documentation, the following is said when probabilities = TRUE,

    With the probability option and factor dependent variable a probability forest is grown. Here, the node impurity is used for splitting, as in classification forests. Predictions are class probabilities for each sample. In contrast to other implementations, each tree returns a probability estimate and these estimates are averaged for the forest probability estimate. For details see Malley et al. (2012).

    Ie. when set to TRUE, you will get probability estimates which you can then classify according to your own threshold-values. I do not know the default decision rule if set to FALSE, however.

    In any case, your approach should be the following,

    Species.ranger <- ranger(
            Species ~ .,
            data = Train_Set,
            importance ="impurity",
            save.memory = TRUE, 
            probability = FALSE
    )
    

    Which then can be evaluated for performance in the confusionMatrix the following way,

    probabilitiesSpecies <- predict(
            Species.ranger,
            data = Test_Set,
            verbose = TRUE
            )
    
    table(
            probabilitiesSpecies$predictions,
            Test_Set$Species
    ) %>% confusionMatrix()
    
    

    Output

    Confusion Matrix and Statistics
    
                
                 setosa versicolor virginica
      setosa         17          0         0
      versicolor      0         16         1
      virginica       0          0        16
    
    Overall Statistics
                                              
                   Accuracy : 0.98            
                     95% CI : (0.8935, 0.9995)
        No Information Rate : 0.34            
        P-Value [Acc > NIR] : < 2.2e-16       
                                              
                      Kappa : 0.97            
                                              
     Mcnemar's Test P-Value : NA              
    
    Statistics by Class:
    
                         Class: setosa Class: versicolor Class: virginica
    Sensitivity                   1.00            1.0000           0.9412
    Specificity                   1.00            0.9706           1.0000
    Pos Pred Value                1.00            0.9412           1.0000
    Neg Pred Value                1.00            1.0000           0.9706
    Prevalence                    0.34            0.3200           0.3400
    Detection Rate                0.34            0.3200           0.3200
    Detection Prevalence          0.34            0.3400           0.3200
    Balanced Accuracy             1.00            0.9853           0.9706