I am using ranger
to fit random forest. As evaluation metric, I am using roc-auc-score, by cvAUC
. After making predictions, when I try to evaluate the auc score, I get an error: Format of predictions is invalid. It couldn't be coerced to a list
. I think this is due to predictions consisting a Level
part which shows the unique levels for predictions. However, I could not get rid of that part. The minimum reproducible example is below, that throws the error:
library(caret)
install.packages("cvAUC")
library(cvAUC)
# Columns for training set
cat.column <- c("cat", "dog", "monkey", "shark", "seal")
num.column <- c(1,2,5,7,9)
class <- c(0,1,0,0,1)
train.set <- data.frame(num.column, cat.column, class)
# Columns for test set
cat.column <- c("cat", "elephant-shrew", "monkey", "monkey", "seal")
num.column <- c(1,11,5,6,8)
class <- c(1,0,1,0,1)
test.set <- data.frame(num.column, cat.column, class)
# Drop the target variable from the test set
target.test <- test.set["class"]
test.set <- test.set[,!names(test.set) %in% "class"]
# Fit random forest
rf = ranger(formula = as.factor(class) ~ . , data = train.set, verbose = FALSE)
# Get predictions
pred <- predict(rf, test.set)
predictions <- pred$predictions
# Get AUC score
auc <- AUC(as.factor(predictions), as.factor(unlist(target.test)), label.ordering = NULL)
cat(auc)
you get the error because AUC
is expecting a numeric vector not a factor. However, in this example, in the test set appears a new level in the column cat.column
(elephant-shrew
). It is good to enter all the possible values that a variable can assume both in the training and in the test set.
library(caret)
library(cvAUC)
library(ranger)
# Columns for training set
cat.column <- c("cat", "dog", "monkey", "shark", "seal")
num.column <- c(1,2,5,7,9)
class <- factor(c(0,1,0,0,1),levels = c(0,1))
train.set <- data.frame(num.column, cat.column, class,stringsAsFactors = F)
# Columns for test set
cat.column <- c("cat", "elephant-shrew", "monkey", "monkey", "seal")
num.column <- c(1,11,5,6,8)
class <- factor(c(1,0,1,0,1),,levels = c(0,1))
test.set <- data.frame(num.column, cat.column, class,stringsAsFactors = F)
# Drop the target variable from the test set
target.test <- test.set["class"]
test.set <- test.set[,!names(test.set) %in% "class"]
# Fit random forest
rf = ranger(formula = class ~ . , data = train.set, verbose = FALSE)
# Get predictions
pred <- predict(rf, test.set)
predictions <- pred$predictions
# Get AUC score
auc <- AUC(as.numeric(predictions), target.test$class, label.ordering = NULL)
cat(auc)
As you can see I slightly change the data preparation steps. First, if your class
column is the outcome of a classification task it is better to coerce it to factor ASAP. Second, if the test set doesn't contain all the values of a character variable (such in your example, in which the column cat.column
contain elephant-shrew
that is not contained in the training set) it is better to handle that variable as a character (in this case you can use the stringAsFactor=F
to keep character variable as character