rsvme1071

Error in training SVM model : Error: One or more factor levels in the outcome has no data: '2'


I have the following data set( sample of 1st 10 rows given)

structure(list(variableA = c(11L, 7L, 17L, 7L, 7L, 2L, 
2L, 7L, 7L, 4L), variableB = c(10L, 20L, 4L, 0L, 0L, 1L, 
1L, 0L, 0L, 2L), variableC = c(284L, 
43L, 19L, 0L, 0L, 27L, 27L, 0L, 0L, 20L), variableD = c(299L, 
24L, 28L, 167L, 167L, 27L, 27L, 194L, 194L, 21L), variableE = c(2, 
1, 1, 1, 1, 1, 1, 1, 1, 1), variableF1 = c(0L, 0L, 
0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L), variableF2 = c(0L, 
0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L), variableF3 = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), variableF4 = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), variableF5 = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), variableF6 = c(1L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), variableF7 = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), variableF8 = c(0L, 
1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), variableF9 = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), variableF10 = c(0L, 
0L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 0L), variableG1 = c(1L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), variableG2 = c(0L, 
0L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L), variableG3 = c(0L, 
1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L), clusters = structure(c(3L, 
3L, 3L, 3L, 3L, 3L, 3L, 1L, 6L, 6L), .Label = c("1", "2", "3", 
"4", "5", "6"), class = "factor"), out = structure(c(1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 6L, 6L), .Label = c("3", "1", "2", "4", 
"5", "6"), class = "factor")), row.names = c(1L, 3L, 4L, 5L, 
6L, 8L, 9L, 12L, 13L, 14L), class = "data.frame")

i have been trying to use the suppport vector machine algorithm on this data set, earlier it was working well now for some reason its giving the error.

model i am trying is

set.seed(111)
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
svm_Linear <- train(out~`variableA`                   + `variableB`      +              
                      `variableC` +`variableD`+
                      `variableE`                           +`variableF1`            +
                      `variableF2`            + `variableF3`           +
                      `variableF4`            + `variableF5`           + 
                      `variableF6`            + `variableF7`           + 
                      `variableF8`            + `variableF9`           + 
                      `variableF10`            + `variableG1`                  + 
                      `variableG2`                   + `variableG3`  , data= train, method = "svmLinear",
                    trControl=trctrl,
                    preProcess = c("center", "scale"),
                    tuneLength = 10)
svm_Linear

But I am getting this error which I am not able to understand.

Error: One or more factor levels in the outcome has no data: '2'

I saw a similar post on this site but none has the answer I required


Solution

  • Your out column is a factor with 6 levels, but only 3 are represented in the dput you provided in your post - that's why you're getting this error.

    levels(train$out)
    # "3" "1" "2" "4" "5" "6"
    
    unique(train$out)
    # 3 1 6
    # Levels: 3 1 2 4 5 6
    

    This is probably due to the way you performed your train/test split .

    You can redefine levels(out) to include only c(1, 3, 6), but this will be a problem if your test data contains the other response levels.

    Consider using a stratified sampling approach instead, to ensure your response variable is correctly represented across a train/test split. Questions about stratified sampling would be more appropriate for Cross Validated than for Stack Overflow, but there are some good starting points mentioned in this SO post and this one.