I am trying to understand and implement One Class Classification in R on dataset in Kaggle(https://www.kaggle.com/uciml/breast-cancer-wisconsin-data).
When trying to print a confusion matrix you are giving the error:
Error in! All.equal (nrow (data), ncol (data)): invalid type argument
What am I doing wrong?
library(caret)
library(dplyr)
library(e1071)
library(NLP)
library(tm)
library(data.table)
ds = read.csv('C:/Users/hugos/Desktop/FS Dataset/Health/data_cancer.csv',
header = TRUE)
mycols <- c("id","diagnosis","radius_mean","texture_mean","perimeter_mean","area_mean",
"smoothness_mean","compactness_mean","concavity_mean",
"concave.points_mean","symmetry_mean","fractal_dimension_mean",
"radius_se","texture_se","perimeter_se",
"area_se","smoothness_se","compactness_se",
"concavity_se","concave.points_se","symmetry_se",
"fractal_dimension_se","radius_worst","texture_worst",
"perimeter_worst","area_worst","smoothness_worst",
"compactness_worst","concavity_worst","concave.points_worst",
"symmetry_worst","fractal_dimension_worst")
#Convert to numeric
setDT(ds)[, (mycols) := lapply(.SD, as.numeric), .SDcols = mycols]
#Convert classification to logical
data <- ds[,.(id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave.points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave.points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave.points_worst,symmetry_worst,fractal_dimension_worst,diagnosis = ds$diagnosis == "TRUE")]
dataclean <- na.omit(data)
#Separating train and test
inTrain<-createDataPartition(1:nrow(dataclean),p=0.7,list=FALSE)
train<- dataclean[inTrain]
test <- dataclean[-inTrain]
svm.model<-svm(diagnosis ~ id+radius_mean+texture_mean+perimeter_mean+area_mean+smoothness_mean+compactness_mean+concavity_mean+concave.points_mean+symmetry_mean+fractal_dimension_mean+radius_se+texture_se+perimeter_se+area_se+smoothness_se+compactness_se+concavity_se+concave.points_se+symmetry_se+fractal_dimension_se+radius_worst+texture_worst+perimeter_worst+area_worst+smoothness_worst+compactness_worst+concavity_worst+concave.points_worst+symmetry_worst+fractal_dimension_worst, data = train,
type='one-classification',
trControl = fitControl,
nu=0.10,
scale=TRUE,
kernel="radial",
metric = "ROC")
#Perform predictions
svm.predtrain<-predict(svm.model,train)
svm.predtest<-predict(svm.model,test)
confTrain <- table(Predicted=svm.predtrain,
Reference=train$diagnosis[as.integer(names(svm.predtrain))])
confTest <- table(Predicted=svm.predtest,
Reference=test$diagnosis[as.integer(names(svm.predtest))])
confusionMatrix(confTest,positive='TRUE')
print(confTrain)
print(confTest)
Your problem is on this line:
#Convert classification to logical
data <- ds[, .(id, radius_mean, ..., diagnosis = ds$diagnosis == "TRUE")]
I'm assuming you are using R version 4.0, since the default behaviour of the read.csv
function is to now not convert character columns into factors. This command:
#Convert to numeric
setDT(ds)[, (mycols) := lapply(.SD, as.numeric), .SDcols = mycols]
will then convert all diagnoses to NA, since they are either "M" or "B" representing malignant and benign, respectively.
So, make sure that you are converting strings to factors when importing the data.
ds = read.csv('.../data_cancer.csv', header = TRUE, stringsAsFactors = TRUE)
str(ds)
'data.frame': 569 obs. of 33 variables:
$ id : int 842302 842517 84300903 84348301 84358402 843786 844359 ...
$ diagnosis : Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 2 ...
I guess it will take some people a while to get used to this new behaviour of R. Your command to convert the classification to logical should then be:
data <- ds[, .(id, radius_mean, ..., diagnosis = diagnosis == 2)] # or == 1 ?
Which then makes all your remaining commands work.
confusionMatrix(confTest, positive='TRUE')
Confusion Matrix and Statistics
Reference
Predicted FALSE TRUE
FALSE 10 8 # Note these numbers may change
TRUE 100 50
Accuracy : 0.3571
95% CI : (0.2848, 0.4346)
No Information Rate : 0.6548
P-Value [Acc > NIR] : 1
Kappa : -0.0342
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.86207
Specificity : 0.09091
Pos Pred Value : 0.33333
Neg Pred Value : 0.55556
Prevalence : 0.34524
Detection Rate : 0.29762
Detection Prevalence : 0.89286
Balanced Accuracy : 0.47649
'Positive' Class : TRUE