I would like to perform k-fold cross-validation in R for a linear regression model and test the one standard error rule:
https://stats.stackexchange.com/questions/17904/one-standard-error-rule-for-variable-selection
Thus, I need a function which gives me back the cross-validation estimate of the prediction error and the standard error of this estimate (or at least the MSE for each fold, so that I can compute the standard error myself). Many packages have functions which compute the cross-validation error (for example, cv.glm
in the boost
package), but usually they return only the CV estimate of the prediction error and not its standard error, or the MSE for each fold.
I tried using package DAAG
, whose function CVlm
should give a richer output than cv.glm
. However, I can't seem to make it work! Here is my code:
a=c(0.0056, 0.0088, 0.0148, 0.0247, 0.0392, 0.0556, 0.0632, 0.0686, 0.0786, 0.0855, 0.0937)
b=c(6.0813, 9.5011, 15.5194, 23.9409, 32.8492, 40.8399, 43.8760, 45.5270, 46.7668, 46.1587, 43.4524)
dataset=data.frame(x=a,y=b)
CV.list=CVlm(df=dataset,form.lm = formula(y ~ poly(x,2)), m=5)
I get the hardly informative error
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ
which doesn't make much sense to me. x
and y
are the same length (11), so clearly the function is complaining about some other x
,y
variables it created internally.
I'd gladly accept solutions with other packages (for example caret
). Also, it would be great if I could specify a number of repetitions for the k-fold cross-validation.
CVlm
doesn't like the poly(x,2)
in your formula. You can easily circumvent that by adding the result of poly(x,2)
first in your datatable and calling CVlm
on these new variables:
dataset2 <- cbind(dataset,poly(dataset$x,2))
names(dataset2)[3:4] <- c("p1","p2")
CV.list=CVlm(df=dataset2,form.lm = formula(y ~ p1+p2))
And as you are interested by the values printed that are unfortunately not saved anywhere you can use something like:
# captures the printed output
printOut <- capture.output(CV.list=CVlm(df=dataset2,form.lm = formula(y ~ p1+p2)))
# function to parse the output
# to be adapted if necessary for your needs
GetValues <- function(itemName,printOut){
line <- printOut[grep(itemName,printOut)]
items <- unlist(strsplit(line,"[=]| +"))
itemsMat <- matrix(items,ncol=2,byrow=TRUE)
vectVals <- as.numeric(itemsMat[grep(itemName,itemsMat[,1]),2])
return(vectVals)
}
# get the Mean square values as a vector
MS <- GetValues("Mean square",printOut)