rmachine-learningknn

Problem conducting K nearest neighbors using LOOCV


I have an example table that I would like to conduct KKNN to classify on. The variable, V4 is the response and I want the classifier to see if a new data point will classify as 0 or 1 (the actual data has 12 columns and the 12th column is the response but I will simplify the example nonetheless

library(kknn)

data <- data.frame(
  V1=c(1.2, 2.5, 3.1, 4.8, 5.2), 
  V2=c(0.7, 1.8, 2.3, 3.9, 4.1), 
  V3=c(2.3, 3.7, 1.8, 4.2, 5.5), 
  V4= c(0, 1, 0, 1, 0)
)

Now, I want to build a kknnclassification via LOOCV using a for loop. Lets assume kknn=3

for (i in 1:nrow(data)) {
  train_data <- data[-i, 1:3]
  train_data_response <- data.frame(data[-i, 4])
  colnames(train_data_response) <- "Response"
  test_set <- data[i, 3]
  model <- kknn(formula=train_data_response ~ ., data.frame(train_data), 
                data.frame(test_set), k=3, scale=TRUE) 
}

Now I get this error that says:

Error in model.frame.default(formula, data = train) : 
  invalid type (list) for variable 'train_data_response'

Is there any way on how I can solve this error? I thought kknn accepts matrix or dataframes. My training and testing data are indeed dataframes so what gives?

Also, am I doing the LOOCV correctly?


Solution

  • We want to leave one out from train_data to validate if our results are not driven by one specific row, and we won't touch test_set. Both are created even before doing the kknn without LOOCV,

    > set.seed(42)
    > smp <- sample.int(nrow(data), nrow(data)*.7)
    > train_data <- data[smp, ]
    > test_set <- data[-smp, ]
    > fit <- kknn(formula=as.factor(Response) ~ ., train=train_data, 
    +             test=test_set, k=3, scale=TRUE) 
    

    so we don't need the raw data anymore.

    Say we want the result as a matrix loo of nrow(loo) == (test_set) and ncol(loo) == (train_data), we initialize it doing

    > loo <- matrix(NA_character_, nrow=nrow(test_set), ncol=nrow(train_data))
    

    and fill it now leaving one out in the kknn.

    > for (i in seq_len(nrow(train_data))) {
    +   fit_loo <- kknn(formula=as.factor(Response) ~ ., train=train_data[-i, ], 
    +                   test=test_set, k=3, scale=TRUE) 
    +   loo[, i] <- as.character(fit_loo$fitted.values)
    + }
    

    Note that we better classify the response as.factor in the formula, which adds safety if is numerish as in OP. The fit$fitted.values will, thus, also come back as a factor which in the matrix we want as.character, though, to prevent coercing the factors to integers.

    Now we can do many things with the loo result, e.g. look which left out observation might influence model prediction,

    > loo == fit$fitted.values
         [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]  [,9] [,10] [,11]
    [1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE  TRUE  TRUE
    [2,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE  TRUE  TRUE  TRUE
    [3,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE  TRUE  TRUE  TRUE
    [4,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE  TRUE  TRUE
    [5,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE  TRUE  TRUE  TRUE
    

    which is ninth row of train_data in this case.

    Or calculate the ratio where all classifications were predicted correctly.

    > mean(apply(loo == fit$fitted.values, 2, all))
    [1] 0.9090909
    

    Data:

    Extended a little to have more observations.

    data <- data.frame(
      V1=c(1.2, 2.5, 3.1, 4.8, 5.2), 
      V2=c(0.7, 1.8, 2.3, 3.9, 4.1), 
      V3=c(2.3, 3.7, 1.8, 4.2, 5.5), 
      Response= c(0, 1, 0, 1, 0)
    )
    data <- rbind.data.frame(data, data, data, row.names=FALSE)