Background
I have a data set on which I have fitted a lasso model using the tidymodels
package. Now, I have artificially created a new data set on which I would like to use this model to make predictions. This is a classifications exercise, where I have 10 predictors trying to determine 1 binary variable.
Issue
When I use augment(extract_fit_parsnip(lasso_final_fit), new_data = new_investors) %>% glimpse()
, R throws me the following error: Error in new_data[, rownames(object$fit$beta), drop = FALSE] : subscript out of bounds
.
My model was coded as follows:
log_rec <- recipe(defensive ~ ., data = modelling_train) %>%
step_select(-date) %>%
step_novel(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors())
lasso_final_spec <- logistic_reg(engine = "glmnet",
penalty = 0.03,
mixture = 1)
lasso_final_wf <- workflow(preprocessor = log_rec,
spec = lasso_final_spec)
lasso_final_fit <- last_fit(lasso_final_wf,
split = modelling_split,
metrics = metric_set(accuracy, sens, spec, roc_auc))
My new_investors
data set was created as follows:
new_investors <- data.frame(sex = as.factor(sample(c("female", "male"), 100, T)),
date = as_date(sample(c("2020-03-15", "2021-03-15"), 100, T)),
age = sample(seq(min(modelling %>% pull(age)),
max(modelling %>% pull(age))), 100, T),
tx = sample(seq(min(modelling %>% pull(tx)),
max(modelling %>% pull(tx))), 100, T),
risk_before = sample(c(0, 0.25, 0.5, 0.75, 1), 100, T),
risk_after = sample(c(0, 0.25, 0.5, 0.75, 1), 100, T),
fund_value = rnorm(100, mean(cleaned %>% pull(fund_value), na.rm = T),
sd(cleaned %>% pull(fund_value), na.rm = T)))
new_investors
to modelling_train
and am very sure that the predictors I used in the model are all contained within new_investors
. Column names are the same as well.predict
function only accepts data.matrix
data sets, so I tried that as well, but it did not work too.new_investors
data set in the augment
code for this model, the predictions come out perfectly fine.I must also apologise in advance for not producing an MWE. I know this is what is usually needed, but I am not sure how to produce an MWE in this case.
What could possibly be the issue here? Any intuitive explanations will be greatly appreciated :)
Hard to tell without the data but I think that you need to make predictions from the workflow. Using just the parsnip fit avoids preprocessing with the recipe. As a result, your model fit gets different data (sometimes silently) than what it should have gotten if the data were processed with the recipe.
Here's an example:
library(tidymodels)
data("two_class_dat")
set.seed(1)
split <- initial_split(two_class_dat)
lasso_final_spec <- logistic_reg(engine = "glmnet",
penalty = 0.03,
mixture = 1)
lasso_final_wf <- workflow(preprocessor = Class ~ .,
spec = lasso_final_spec)
lasso_final_fit <- last_fit(lasso_final_wf,
split = split,
metrics = metric_set(accuracy, sens, spec, roc_auc))
# Get the fitted workflow out
lasso_final_wflow <-
lasso_final_fit %>%
extract_workflow()
set.seed(2)
new_dat <- tibble(A = runif(3), B = runif(3))
lasso_final_wflow %>% augment(new_dat)
#> # A tibble: 3 × 5
#> A B .pred_class .pred_Class1 .pred_Class2
#> <dbl> <dbl> <fct> <dbl> <dbl>
#> 1 0.185 0.168 Class1 0.945 0.0554
#> 2 0.702 0.944 Class1 0.775 0.225
#> 3 0.573 0.943 Class1 0.771 0.229
Created on 2022-10-11 by the reprex package (v2.0.1)
Also, glance()
pulls things out of the fitted model so you can't use that on a set of new predictions. I suggest saving your metric set as an object and use it to evaluate these predictions (assuming you know their labels)