I'm having an issue with the preProc argument for the train function using the R caret package. I want to center and scale my predictors but ignore the factor columns. When I preProcess outside of train, it works fine but I'm hoping to pre-process within the train function. Am I missing something?
Below is an example where the factor predictor is ignored when using preProcess outside of train.
df <- data.frame(
score = runif(1000, 80, 110),
var1 = as.factor(sample(0:1, 1000, replace = TRUE)),
var2 = runif(1000, 5, 25)
)
preProcess(df[-1], method=c('center','scale'))
Created from 1000 samples and 2 variables
Pre-processing:
- centered (1)
- ignored (1)
- scaled (1)
Here is what happens when I use preProc inside of train
df <- data.frame(
score = runif(1000, 80, 110),
var1 = as.factor(sample(0:1, 1000, replace = TRUE)),
var2 = runif(1000, 5, 25)
)
mod <- train(score ~., data = df,
method = "lm",
preProc = c("center", "scale"))
mod$preProcess
Created from 1000 samples and 2 variables
Pre-processing:
- centered (2)
- ignored (0)
- scaled (2)
Your call is dispatched to train.formula
where your data is converted to a matrix with the expression model.matrix(Terms, m, contrasts)
.
Since your data are now in matrix form and matrices are atomic the values are coerced to the same type. In this instance double
. This also has an odd side effect of renaming var1
to var11
, which you can see if you inspect the mod$preProcess
output (e.g. mod$preProcess$mean
). Not sure why that is the case, but I do not think it is related to your question.
It appears the class information is captured before this matrix conversion and ultimately output in the results via the ptype
element, but does nothing other than get output:
sapply(mod$ptype, class)
var1 var2
"factor" "numeric"
However, the model matrix is what gets passed to train.default
which then goes on to run preProcess()
. By the time that it reaches that step, the factor information is already stripped and that variable is of class double
. As you noted preProcess()
does a series of checks and only evaluates on numeric data (numeric in the sense that it is of class "integer", "numeric", or "double"). So when preProcess()
is called via train()
your values are already double
, which is why they get scaled and centered.
However, the same conversion to a matrix is not made when you call preProcess()
directly and so the factor class is caught and removed before scaling and centering.
From the preProcess
argument documentation for ?train
it specifies:
Pre-processing code is only designed to work when x is a simple matrix or data frame.
I think this is what they are getting at -- calling this argument is only for "simple" meaning all values are of the same class. If they are not of the same class they will ultimately be coerced.
Long story short, I think you ought to either pass the preprocessed data to train()
or create a recipe and pass that to train()
like so:
library(recipe)
recipe(score ~ ., data = df) |>
step_center(all_numeric_predictors()) |>
step_scale(all_numeric_predictors()) |>
train(method = "lm", data = df)
If you go the recipe
route you should read the documentation carefully to see if factors are included in all_numeric_predictors()
, I am not sure off the top of my head.