I have an example table that I would like to conduct KKNN to classify on. The variable, V4
is the response and I want the classifier to see if a new data point will classify as 0
or 1
(the actual data has 12 columns and the 12th column is the response but I will simplify the example nonetheless
library(kknn)
data <- data.frame(
V1=c(1.2, 2.5, 3.1, 4.8, 5.2),
V2=c(0.7, 1.8, 2.3, 3.9, 4.1),
V3=c(2.3, 3.7, 1.8, 4.2, 5.5),
V4= c(0, 1, 0, 1, 0)
)
Now, I want to build a kknn
classification via LOOCV using a for
loop. Lets assume kknn=3
for (i in 1:nrow(data)) {
train_data <- data[-i, 1:3]
train_data_response <- data.frame(data[-i, 4])
colnames(train_data_response) <- "Response"
test_set <- data[i, 3]
model <- kknn(formula=train_data_response ~ ., data.frame(train_data),
data.frame(test_set), k=3, scale=TRUE)
}
Now I get this error that says:
Error in model.frame.default(formula, data = train) :
invalid type (list) for variable 'train_data_response'
Is there any way on how I can solve this error? I thought kknn
accepts matrix or dataframes. My training and testing data are indeed dataframes so what gives?
Also, am I doing the LOOCV correctly?
We want to leave one out from train_data to validate if our results are not driven by one specific row, and we won't touch test_set. Both are created even before doing the kknn
without LOOCV,
> set.seed(42)
> smp <- sample.int(nrow(data), nrow(data)*.7)
> train_data <- data[smp, ]
> test_set <- data[-smp, ]
> fit <- kknn(formula=as.factor(Response) ~ ., train=train_data,
+ test=test_set, k=3, scale=TRUE)
so we don't need the raw data anymore.
Say we want the result as a matrix loo
of nrow(loo) == (test_set)
and ncol(loo) == (train_data)
, we initialize it doing
> loo <- matrix(NA_character_, nrow=nrow(test_set), ncol=nrow(train_data))
and fill it now leaving one out in the kknn
.
> for (i in seq_len(nrow(train_data))) {
+ fit_loo <- kknn(formula=as.factor(Response) ~ ., train=train_data[-i, ],
+ test=test_set, k=3, scale=TRUE)
+ loo[, i] <- as.character(fit_loo$fitted.values)
+ }
Note that we better classify the response as.factor
in the formula, which adds safety if is numerish as in OP. The fit$fitted.values
will, thus, also come back as a factor which in the matrix we want as.character
, though, to prevent coercing the factors to integers.
Now we can do many things with the loo
result, e.g. look which left out observation might influence model prediction,
> loo == fit$fitted.values
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
[2,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[3,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[4,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
[5,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
which is ninth row of train_data in this case.
Or calculate the ratio where all
classifications were predicted correctly.
> mean(apply(loo == fit$fitted.values, 2, all))
[1] 0.9090909
Data:
Extended a little to have more observations.
data <- data.frame(
V1=c(1.2, 2.5, 3.1, 4.8, 5.2),
V2=c(0.7, 1.8, 2.3, 3.9, 4.1),
V3=c(2.3, 3.7, 1.8, 4.2, 5.5),
Response= c(0, 1, 0, 1, 0)
)
data <- rbind.data.frame(data, data, data, row.names=FALSE)