rr-caretnamesiml

How does R's iml package handle syntactically invalid factor levels?


I'm using the iml package to derive ALE values from a caret trained rf model. In classification tasks where the levels of the dependent variable have syntactically invalid string values this can cause issues as under the hood these levels end up as column names during prediction.

Here is a silly example which will throw an undefined columns selected error with the last line of code:

# ----- Packages -----
library(randomForest)
library(caret)
library(iml)

# ----- Dummy Data -----
One <- as.factor(sample(c("1", "0"), size = 250, replace = TRUE))
Two <- as.factor(sample(make.names(c("1", "0")), size = 250, replace = TRUE))
Three <- as.factor(sample(c("A-1_x", "B-0_y", "1 C-$_3.5"), size = 250, replace = TRUE))
Four <- as.factor(sample(make.names(c("A-1_x", "B-0_y", "1 C-$_3.5")), size = 250, replace = TRUE))
df <- cbind.data.frame(One, Two, Three, Four)

# ----- Modelling + IML for syntactically invalid levels from "Three" -----
ALE.ClassOfInterest <- "1 C-$_3.5"
TrainData <- cbind.data.frame(One, Two, Four)
rf <- caret::train(TrainData, Three, method = "rf", tuneLength = 3, trControl = trainControl(method = "cv"))
Pred <- Predictor$new(rf, data=df, class=ALE.ClassOfInterest)
FE3 <- FeatureEffects$new(Pred, features=names(df), method="ale")$results

I had some examples where a very simple modifcation did the trick, simply calling make.names in the 2nd last line of code like so:

Pred <- Predictor$new(rf, data=df, class=make.names(ALE.ClassOfInterest))

However, in the above example this does not help and the only solution I found is to use make.names at the very beginning to turn all levels into syntactically valid strings before even training the model (see column "Four"). However, I'd like to stick to the original strings for various reasons and I have noted that other equally invalid levels like "0", "1" (see column "One") don't require any workarounds and this works:

# ----- Modelling + IML for syntactically invalid levels from "One" -----
ALE.ClassOfInterest <- "1"
TrainData <- cbind.data.frame(Two, Three, Four)
rf <- caret::train(TrainData, One, method = "rf", tuneLength = 3, trControl = trainControl(method = "cv"))
Pred <- Predictor$new(rf, data=df, class=ALE.ClassOfInterest)
FE1 <- FeatureEffects$new(Pred, features=names(df), method="ale")$results

Does anyone know what is happening under the hood if it is not a plain make.names or can suggest a solution which let's me stick to the original factor levels in the model?

Thanks, Mark


Solution

  • This appears to be a feature/bug already identified to the package author in issue iml/195. I'm not optimistic for a quick fix, since that issue was identified in July 2022 (20 months ago as of writing this answer) with no commentary from the author. (The last change to R functions was in April 2022, it does not appear to get many updates.)