Thank you so much for your help ahead of time. I'm currently working with a data set that has 794 observations and 1023 variables. I'm attempting to do some sort of feature selection on the data. My initial thought was to do random forest rfe, but the code was taking more than 24 hours to run so I stopped that. My next thought was to use rfe again but with partial least squares since that runs much more quickly than random forest models do. When I did so I got the following error:
"Error in { : task 1 failed - "wrong sign in 'by' argument".
I'll present my code below, but I understand this error comes from a seq() arugument in which there is a negative value of some sort, but my sequence is (1,1021, by =2). I don't think there's anything wrong there. I got the error after the code ran for about 6-7 hours. My question I guess is two fold:
train.control <- trainControl(method = "cv", number = 10)
#Recursive Feature Elimination Partial Least Squares
predVars <- names(Training)[!names(Training) %in% c("MOV")]
varSeq <- seq(1, 1021, by = 2)
ctrl <- rfeControl(method = "cv",
number = 10,
verbose = FALSE,
functions = caretFuncs)
Results <- rfe(x = Training[,predVars], y = Training$MOV, sizes = varSeq,
rfeControl = ctrl, method = "pls", tuneLength = 15,
preProc = c("center","scale"), trControl = train.control)
Your varSeq
vector is likely the source of the error message, and also the long duration of your computations.
The ?caret::rfe
documentation says the sizes
argument is meant to be:
a numeric vector of integers corresponding to the number of features that should be retained
As is, your varSeq
has 500+ integers from 1 to 1021. Having the sequence start with 1 causes the error (I'd guess because sizes = 1
can't be computed). Notice that in the examples in the documentation the sizes
vectors have minimum values that are at least 2.
Also, having 500+ 'sizes' to go through with your data just takes along time. So, to avoid this error and speed up the analysis try something like:
varSeq <- c(2:25, 30, 35, 40, 45, 50, 55, 60, 65)
When I execute your code with some sample data and this sort of adjustment, the analysis runs to completion.
library(caret)
Training <- data.frame(MOV = factor(rep(c("A", "B"), 400)),
F1 = sample(0:1, 800, replace = TRUE),
F2 = sample(0:1, 800, replace = TRUE),
F3 = sample(0:1, 800, replace = TRUE),
F4 = sample(0:1, 800, replace = TRUE),
F5 = sample(0:1, 800, replace = TRUE))
train.control <- trainControl(method = "cv", number = 10)
#Recursive Feature Elimination Partial Least Squares
predVars <- names(Training)[!names(Training) %in% c("MOV")]
varSeq <- seq(2, 5, by = 2)
ctrl <- rfeControl(method = "cv",
number = 10,
verbose = FALSE,
functions = caretFuncs)
Results <- rfe(x = Training[,predVars], y = Training$MOV, sizes = varSeq,
rfeControl = ctrl, method = "pls", tuneLength = 15,
preProc = c("center","scale"), trControl = train.control)