rmachine-learningr-caret

Center and scale factor predictors within train function from caret in R


I'm having an issue with the preProc argument for the train function using the R caret package. I want to center and scale my predictors but ignore the factor columns. When I preProcess outside of train, it works fine but I'm hoping to pre-process within the train function. Am I missing something?

Below is an example where the factor predictor is ignored when using preProcess outside of train.

df <- data.frame(
    score = runif(1000, 80, 110),
    var1 = as.factor(sample(0:1, 1000, replace = TRUE)),
    var2 = runif(1000, 5, 25)
)
preProcess(df[-1], method=c('center','scale'))

Created from 1000 samples and 2 variables

Pre-processing:
  - centered (1)
  - ignored (1)
  - scaled (1)

Here is what happens when I use preProc inside of train

df <- data.frame(
    score = runif(1000, 80, 110),
    var1 = as.factor(sample(0:1, 1000, replace = TRUE)),
    var2 = runif(1000, 5, 25)
)
mod <- train(score ~., data = df,
             method = "lm",
             preProc = c("center", "scale"))
mod$preProcess

Created from 1000 samples and 2 variables

Pre-processing:
  - centered (2)
  - ignored (0)
  - scaled (2)

Solution

  • Your call is dispatched to train.formula where your data is converted to a matrix with the expression model.matrix(Terms, m, contrasts).

    Since your data are now in matrix form and matrices are atomic the values are coerced to the same type. In this instance double. This also has an odd side effect of renaming var1 to var11, which you can see if you inspect the mod$preProcess output (e.g. mod$preProcess$mean). Not sure why that is the case, but I do not think it is related to your question.

    It appears the class information is captured before this matrix conversion and ultimately output in the results via the ptype element, but does nothing other than get output:

    sapply(mod$ptype, class)
         var1      var2 
     "factor" "numeric" 
    

    However, the model matrix is what gets passed to train.default which then goes on to run preProcess(). By the time that it reaches that step, the factor information is already stripped and that variable is of class double. As you noted preProcess() does a series of checks and only evaluates on numeric data (numeric in the sense that it is of class "integer", "numeric", or "double"). So when preProcess() is called via train() your values are already double, which is why they get scaled and centered.

    However, the same conversion to a matrix is not made when you call preProcess() directly and so the factor class is caught and removed before scaling and centering.

    From the preProcess argument documentation for ?train it specifies:

    Pre-processing code is only designed to work when x is a simple matrix or data frame.

    I think this is what they are getting at -- calling this argument is only for "simple" meaning all values are of the same class. If they are not of the same class they will ultimately be coerced.


    Long story short, I think you ought to either pass the preprocessed data to train() or create a recipe and pass that to train() like so:

    library(recipe)
    recipe(score ~ ., data = df) |>
      step_center(all_numeric_predictors()) |>
      step_scale(all_numeric_predictors()) |>
      train(method = "lm", data = df)
    

    If you go the recipe route you should read the documentation carefully to see if factors are included in all_numeric_predictors(), I am not sure off the top of my head.