I have a data set with 90 variables and 200000 obs. It is unbalanced as it has only 4% cases where target variable is 1, in all other cases it is 0.
I split it to 2 sets: fitting(185000) and holdout sample "df_holdout" (15000 obs.) So, I decided to take from the fitting sample for model fitting all cases where target variable = 1 and the same amount of cases where target variable = 0. (in total the set "df" included 25000 obs.)
Variables have names var_01, var_02, var_03, ... var_90 , where var_90 was renamed into "target".
I have a stack of workflows.
This is the code that I use for model fitting:
rf_tune <- parsnip::rand_forest(mode="classification",
mtry = tune(),
trees = 1000,
min_n = tune()) %>%
set_engine("ranger",
importance = "impurity")
svm_tune <- parsnip::svm_poly(mode = "classification",
engine = "kernlab",
cost = tune(),
degree = tune(),
scale_factor = tune(),
margin = tune())
# Create data split object
df_split <- initial_split(df, prop = 0.75,
strata = target)
# Create the training data
df_train <- df_split %>%
training()
df_test <- df_split %>%
testing()
# create a recipe
df_recipe <- recipe(target ~., data = df_train) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric()) %>%
step_corr(threshold = 0.7) %>%
step_dummy(all_nominal_predictors(), -all_outcomes())
df_recipe %>%
prep(df_train) %>%
bake(df_train)
all_models_set <-
workflow_set(preproc = list(df_recipe = df_recipe),
models = list(rf_tune,
svm_tune),
cross = TRUE)
set.seed(123)
cv <- vfold_cv(df_training, v=5, repeats=1, strata=target)
df_metr <- metric_set(accuracy, roc_auc,sens,spec)
all_models <-
all_models_set %>%
workflow_map("tune_grid",
resamples = cv,
grid = 10,
control = control_resamples( save_pred = T, save_workflow = T, verbose = T),
metrics = df_metr
)
# Get the workflow ID for the top model from our workflow set
best_workflow <-
rank_results(all_models, rank_metric = "roc_auc", select_best = TRUE) %>%
filter(.metric=="roc_auc" & rank==1)
final_model <-
extract_workflow_set_result(all_models, pull(best_tuned_workflow, wflow_id)) %>%
select_best(metric = "roc_auc")
# Fit final model on Train and predict on Test set
final_model_pred <-
extract_workflow(all_models, pull(best_tuned_workflow, wflow_id)) %>% # extract the workflow
finalize_workflow(final_model) %>%
last_fit(df_split) # fit the model on Train and score on Test
# final workflow extraction
wf_final_model <- extract_workflow(final_model_pred)
After I created a model and trained the workflow (wf_final_model), I saved it and wanted to use for prediction on a holdout sample. However, when I tried to do it I got an error message:
predict(wf_final_model, df_holdout)
Error: Missing data in columns: var_02_X4, var_02_X7, var_02_X9, var_02_X10, var_02_X11, var_02_X12, var_02_X13, var_02_X15, var_02_X17, var_02_X18, var_02_X20, var_02_X21, var_02_X22, var_02_X23, var_02_X24, var_02_X25, var_02_X26, var_02_X27, var_02_X28, var_02_X29, var_02_X30, var_02_X31, var_02_X33, var_02_X34, var_30_X2, var_30_X3, var_30_X6, var_30_X7, var_30_X9, var_30_X11, var_30_X13, var_30_X14, var_30_X15, var_30_X16, var_30_X17, var_30_X18, var_30_X19, var_30_X20, var_30_X22, var_30_X23, var_30_X24, var_30_X25, var_30_X26, var_30_X27, var_30_X33, var_30_X43, var_30_X46, var_30_X48, var_30_X49, var_30_X51, var_30_X56, var_30_X57, var_30_X60, var_36_X14, var_36_X18, var_36_X21, var_36_X24, var_36_X28, var_36_X29, var_36_X32, var_36_X44, var_36_X57, var_36_X61, var_36_X63, var_36_X85, var_36_X125, var_36_X130, var_36_X136, var_36_X144, var_36_X147, var_36_X148, var_36_X166, var_36_X169, var_36_X171, var_89_X3, var_89_X4, var_89_X5, var_89_X6, var_89_X7, var_89_X8, var_89_X9, va
In addition: Warning messages:
1: Novel levels found in column 'var_02': '2', '5'. The levels have been removed, and values have been coerced to 'NA'.
2: Novel levels found in column 'var_30': '39', '41', '42', '47', '54'. The levels have been removed, and values have been coerced to 'NA'.
3: Novel levels found in column 'var_36': '118'. The levels have been removed, and values have been coerced to 'NA'.
4: Novel levels found in column 'var_89': '2'. The levels have been removed, and values have been coerced to 'NA'.
5: There are new levels in a factor: NA
6: There are new levels in a factor: NA
7: There are new levels in a factor: NA
8: There are new levels in a factor: NA
I don't have any variables with such names neither in training set, nor in test or holdout set. As I understand, such variables depict interactions, but I am not sure how to handle it. Can you help me please to fix the error in order to get the predictions?
The variable names you are seeing, var_02_X4
, var_02_X7
, var_02_X9
, var_02_X10
, were created by step_dummy()
, e.i. var_02
had the levels X4
, X7
, X9
, X10
and so on.
the way you could deal with this issue, is to add step_unknown()
before step_dummy()
.
# create a recipe
df_recipe <- recipe(target ~., data = df_train) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric()) %>%
step_corr(threshold = 0.7) %>%
step_unknown(all_nomial_predictors()) %>%
step_dummy(all_nominal_predictors())
you don't need -all_outcomes()
as all_nominal_predictors()
doesn't select outcomes.