r machine-learning svm confusion-matrix one-class-classification

I am not able to generate the confusion matrix of a classification with One Class in R

I am trying to understand and implement One Class Classification in R on dataset in Kaggle(https://www.kaggle.com/uciml/breast-cancer-wisconsin-data).

When trying to print a confusion matrix you are giving the error:

Error in! All.equal (nrow (data), ncol (data)): invalid type argument

What am I doing wrong?

library(caret)
library(dplyr)
library(e1071)
library(NLP)
library(tm)
library(data.table)

ds = read.csv('C:/Users/hugos/Desktop/FS Dataset/Health/data_cancer.csv', 
              header = TRUE)

mycols <- c("id","diagnosis","radius_mean","texture_mean","perimeter_mean","area_mean",              
             "smoothness_mean","compactness_mean","concavity_mean",         
             "concave.points_mean","symmetry_mean","fractal_dimension_mean", 
             "radius_se","texture_se","perimeter_se",           
             "area_se","smoothness_se","compactness_se",         
             "concavity_se","concave.points_se","symmetry_se",            
             "fractal_dimension_se","radius_worst","texture_worst",          
             "perimeter_worst","area_worst","smoothness_worst",       
             "compactness_worst","concavity_worst","concave.points_worst",   
             "symmetry_worst","fractal_dimension_worst")

#Convert to numeric
setDT(ds)[, (mycols) := lapply(.SD, as.numeric), .SDcols = mycols]

#Convert classification to logical
data <- ds[,.(id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave.points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave.points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave.points_worst,symmetry_worst,fractal_dimension_worst,diagnosis = ds$diagnosis == "TRUE")]

dataclean <- na.omit(data)

#Separating train and test
inTrain<-createDataPartition(1:nrow(dataclean),p=0.7,list=FALSE)
train<- dataclean[inTrain]
test <- dataclean[-inTrain]


svm.model<-svm(diagnosis ~ id+radius_mean+texture_mean+perimeter_mean+area_mean+smoothness_mean+compactness_mean+concavity_mean+concave.points_mean+symmetry_mean+fractal_dimension_mean+radius_se+texture_se+perimeter_se+area_se+smoothness_se+compactness_se+concavity_se+concave.points_se+symmetry_se+fractal_dimension_se+radius_worst+texture_worst+perimeter_worst+area_worst+smoothness_worst+compactness_worst+concavity_worst+concave.points_worst+symmetry_worst+fractal_dimension_worst, data = train,
               type='one-classification',
               trControl = fitControl,
               nu=0.10,
               scale=TRUE,
               kernel="radial",
               metric = "ROC")

#Perform predictions 
svm.predtrain<-predict(svm.model,train)
svm.predtest<-predict(svm.model,test)

confTrain <- table(Predicted=svm.predtrain,
                   Reference=train$diagnosis[as.integer(names(svm.predtrain))])
confTest <- table(Predicted=svm.predtest,
                  Reference=test$diagnosis[as.integer(names(svm.predtest))])

confusionMatrix(confTest,positive='TRUE')

print(confTrain)
print(confTest)

Solution

Your problem is on this line:

#Convert classification to logical
data <- ds[, .(id, radius_mean, ..., diagnosis = ds$diagnosis == "TRUE")]

I'm assuming you are using R version 4.0, since the default behaviour of the read.csv function is to now not convert character columns into factors. This command:

#Convert to numeric
setDT(ds)[, (mycols) := lapply(.SD, as.numeric), .SDcols = mycols]

will then convert all diagnoses to NA, since they are either "M" or "B" representing malignant and benign, respectively.

So, make sure that you are converting strings to factors when importing the data.

ds = read.csv('.../data_cancer.csv', header = TRUE, stringsAsFactors = TRUE)
str(ds)
'data.frame':   569 obs. of  33 variables:
 $ id                     : int  842302 842517 84300903 84348301 84358402 843786 844359 ...
 $ diagnosis              : Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 2 ...

I guess it will take some people a while to get used to this new behaviour of R. Your command to convert the classification to logical should then be:

data <- ds[, .(id, radius_mean, ..., diagnosis = diagnosis == 2)] # or  == 1 ?

Which then makes all your remaining commands work.

confusionMatrix(confTest, positive='TRUE')

Confusion Matrix and Statistics

         Reference
Predicted FALSE TRUE
    FALSE    10    8  # Note these numbers may change
    TRUE    100   50

               Accuracy : 0.3571          
                 95% CI : (0.2848, 0.4346)
    No Information Rate : 0.6548          
    P-Value [Acc > NIR] : 1               

                  Kappa : -0.0342         

 Mcnemar's Test P-Value : <2e-16          

            Sensitivity : 0.86207         
            Specificity : 0.09091         
         Pos Pred Value : 0.33333         
         Neg Pred Value : 0.55556         
             Prevalence : 0.34524         
         Detection Rate : 0.29762         
   Detection Prevalence : 0.89286         
      Balanced Accuracy : 0.47649         

       'Positive' Class : TRUE