rmachine-learningfeature-selectionplspredictive

Recursive Feature Elimination Error - "Error in { : task 1 failed - "wrong sign in 'by' argument"


Thank you so much for your help ahead of time. I'm currently working with a data set that has 794 observations and 1023 variables. I'm attempting to do some sort of feature selection on the data. My initial thought was to do random forest rfe, but the code was taking more than 24 hours to run so I stopped that. My next thought was to use rfe again but with partial least squares since that runs much more quickly than random forest models do. When I did so I got the following error:

"Error in { : task 1 failed - "wrong sign in 'by' argument".

I'll present my code below, but I understand this error comes from a seq() arugument in which there is a negative value of some sort, but my sequence is (1,1021, by =2). I don't think there's anything wrong there. I got the error after the code ran for about 6-7 hours. My question I guess is two fold:

  1. If you guys can think of any better feature selection method that I can run in a few hours than what I'm doing I'm all ears.
  2. If you can't think of anything better do you know how to fix the above error? Really appreciate all of the help on this. Note: predVars in the code below is a chr[1:1022].
train.control <- trainControl(method = "cv", number = 10)

#Recursive Feature Elimination Partial Least Squares
predVars <- names(Training)[!names(Training) %in% c("MOV")]
varSeq <- seq(1, 1021, by = 2)
ctrl <- rfeControl(method = "cv",
                   number = 10,
                   verbose = FALSE,
                   functions = caretFuncs) 

Results <- rfe(x = Training[,predVars], y = Training$MOV, sizes = varSeq,
               rfeControl = ctrl, method = "pls", tuneLength = 15,
               preProc = c("center","scale"), trControl = train.control)

Solution

  • Your varSeq vector is likely the source of the error message, and also the long duration of your computations.

    The ?caret::rfe documentation says the sizes argument is meant to be:

    a numeric vector of integers corresponding to the number of features that should be retained

    As is, your varSeq has 500+ integers from 1 to 1021. Having the sequence start with 1 causes the error (I'd guess because sizes = 1 can't be computed). Notice that in the examples in the documentation the sizes vectors have minimum values that are at least 2.

    Also, having 500+ 'sizes' to go through with your data just takes along time. So, to avoid this error and speed up the analysis try something like:

    varSeq <- c(2:25, 30, 35, 40, 45, 50, 55, 60, 65)
    

    When I execute your code with some sample data and this sort of adjustment, the analysis runs to completion.

    library(caret)
    
    Training <- data.frame(MOV = factor(rep(c("A", "B"), 400)),
                           F1 = sample(0:1, 800, replace = TRUE),
                           F2 = sample(0:1, 800, replace = TRUE),
                           F3 = sample(0:1, 800, replace = TRUE),
                           F4 = sample(0:1, 800, replace = TRUE),
                           F5 = sample(0:1, 800, replace = TRUE))
    
    
    train.control <- trainControl(method = "cv", number = 10)
    
    #Recursive Feature Elimination Partial Least Squares
    predVars <- names(Training)[!names(Training) %in% c("MOV")]
    varSeq <- seq(2, 5, by = 2)
    ctrl <- rfeControl(method = "cv",
                       number = 10,
                       verbose = FALSE,
                       functions = caretFuncs) 
    
    Results <- rfe(x = Training[,predVars], y = Training$MOV, sizes = varSeq,
                   rfeControl = ctrl, method = "pls", tuneLength = 15,
                   preProc = c("center","scale"), trControl = train.control)