rmachine-learningsvmconfusion-matrixone-class-classification

I am not able to generate the confusion matrix of a classification with One Class in R


I am trying to understand and implement One Class Classification in R on dataset in Kaggle(https://www.kaggle.com/uciml/breast-cancer-wisconsin-data).

When trying to print a confusion matrix you are giving the error:

Error in! All.equal (nrow (data), ncol (data)): invalid type argument

What am I doing wrong?

library(caret)
library(dplyr)
library(e1071)
library(NLP)
library(tm)
library(data.table)

ds = read.csv('C:/Users/hugos/Desktop/FS Dataset/Health/data_cancer.csv', 
              header = TRUE)

mycols <- c("id","diagnosis","radius_mean","texture_mean","perimeter_mean","area_mean",              
             "smoothness_mean","compactness_mean","concavity_mean",         
             "concave.points_mean","symmetry_mean","fractal_dimension_mean", 
             "radius_se","texture_se","perimeter_se",           
             "area_se","smoothness_se","compactness_se",         
             "concavity_se","concave.points_se","symmetry_se",            
             "fractal_dimension_se","radius_worst","texture_worst",          
             "perimeter_worst","area_worst","smoothness_worst",       
             "compactness_worst","concavity_worst","concave.points_worst",   
             "symmetry_worst","fractal_dimension_worst")

#Convert to numeric
setDT(ds)[, (mycols) := lapply(.SD, as.numeric), .SDcols = mycols]

#Convert classification to logical
data <- ds[,.(id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave.points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave.points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave.points_worst,symmetry_worst,fractal_dimension_worst,diagnosis = ds$diagnosis == "TRUE")]

dataclean <- na.omit(data)

#Separating train and test
inTrain<-createDataPartition(1:nrow(dataclean),p=0.7,list=FALSE)
train<- dataclean[inTrain]
test <- dataclean[-inTrain]


svm.model<-svm(diagnosis ~ id+radius_mean+texture_mean+perimeter_mean+area_mean+smoothness_mean+compactness_mean+concavity_mean+concave.points_mean+symmetry_mean+fractal_dimension_mean+radius_se+texture_se+perimeter_se+area_se+smoothness_se+compactness_se+concavity_se+concave.points_se+symmetry_se+fractal_dimension_se+radius_worst+texture_worst+perimeter_worst+area_worst+smoothness_worst+compactness_worst+concavity_worst+concave.points_worst+symmetry_worst+fractal_dimension_worst, data = train,
               type='one-classification',
               trControl = fitControl,
               nu=0.10,
               scale=TRUE,
               kernel="radial",
               metric = "ROC")

#Perform predictions 
svm.predtrain<-predict(svm.model,train)
svm.predtest<-predict(svm.model,test)

confTrain <- table(Predicted=svm.predtrain,
                   Reference=train$diagnosis[as.integer(names(svm.predtrain))])
confTest <- table(Predicted=svm.predtest,
                  Reference=test$diagnosis[as.integer(names(svm.predtest))])

confusionMatrix(confTest,positive='TRUE')

print(confTrain)
print(confTest)

Solution

  • Your problem is on this line:

    #Convert classification to logical
    data <- ds[, .(id, radius_mean, ..., diagnosis = ds$diagnosis == "TRUE")]
    

    I'm assuming you are using R version 4.0, since the default behaviour of the read.csv function is to now not convert character columns into factors. This command:

    #Convert to numeric
    setDT(ds)[, (mycols) := lapply(.SD, as.numeric), .SDcols = mycols]
    

    will then convert all diagnoses to NA, since they are either "M" or "B" representing malignant and benign, respectively.

    So, make sure that you are converting strings to factors when importing the data.

    ds = read.csv('.../data_cancer.csv', header = TRUE, stringsAsFactors = TRUE)
    str(ds)
    'data.frame':   569 obs. of  33 variables:
     $ id                     : int  842302 842517 84300903 84348301 84358402 843786 844359 ...
     $ diagnosis              : Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 2 ...
    

    I guess it will take some people a while to get used to this new behaviour of R. Your command to convert the classification to logical should then be:

    data <- ds[, .(id, radius_mean, ..., diagnosis = diagnosis == 2)] # or  == 1 ?
    

    Which then makes all your remaining commands work.

    confusionMatrix(confTest, positive='TRUE')
    

    Confusion Matrix and Statistics
    
             Reference
    Predicted FALSE TRUE
        FALSE    10    8  # Note these numbers may change
        TRUE    100   50
    
                   Accuracy : 0.3571          
                     95% CI : (0.2848, 0.4346)
        No Information Rate : 0.6548          
        P-Value [Acc > NIR] : 1               
    
                      Kappa : -0.0342         
    
     Mcnemar's Test P-Value : <2e-16          
    
                Sensitivity : 0.86207         
                Specificity : 0.09091         
             Pos Pred Value : 0.33333         
             Neg Pred Value : 0.55556         
                 Prevalence : 0.34524         
             Detection Rate : 0.29762         
       Detection Prevalence : 0.89286         
          Balanced Accuracy : 0.47649         
    
           'Positive' Class : TRUE