I am teaching myself how to use the excellent tidymodels collection of packages to practice machine learning.
In the below example, I am basically trying to reproduce Julie Sigle's blog post here (https://juliasilge.com/blog/water-sources/) on using the ranger package to predict water sources.
I'm not using her dataset in that blog but instead using the built-in diamonds dataset as practice.
I can recreate all of sets except for the yardstick::roc_curv() when I try to plot the truth against the prediction.
The error I get it is below
Error in `dplyr::summarise()`:
! Problem while computing `.estimate = metric_fn(...)`.
ℹ The error occurred in group 1: id = "Fold01".
Caused by error in `validate_class()`:
! `estimate` should be a numeric but a factor was supplied.
While the data set and transformation steps are different, the below steps roughly correspond to what is in the above link.
I recognize statistically there may be more valid or better ways of doing this, but I am just trying to get more familiarity with the tools and packages and experience in using them.
library(tidyverse)
library(tidymodels)
# set a outcome variable that I want to try and predict (e.g. price is above $10,000)
diamonds <- diamonds %>%
mutate(high_price_indicator=if_else(price>10000,"high","low"))
#split data sets
data_split <- rsample::initial_split(diamonds,strata = high_price_indicator)
training_split <- rsample::training(data_split)
testing_split <- rsample::testing(data_split)
# cross fold
diamonds_fold <- rsample::vfold_cv(training_split,strata=high_price_indicator)
#choose model, set engine and mode
rf_spec <- parsnip::rand_forest(trees = 1000) %>%
set_mode("classification") %>%
set_engine("ranger")
#set recipe and do some transformations - not sure if the error is here
rec <- recipes::recipe(high_price_indicator ~., data=training_split) %>%
recipes::step_normalize(all_numeric_predictors()) %>%
step_zv(all_predictors(),) %>%
step_dummy(c("cut","color","clarity"),one_hot = TRUE)
# create the workflow
workflow <- workflow() %>%
add_model(rf_spec) %>%
add_recipe(rec)
# fit workflow to cross folded data and save predictions
fit_folds <- tune::fit_resamples(workflow,
resamples = diamonds_fold,
control = control_resamples(save_pred = TRUE)
)
# this is where I get the error
collect_predictions(fit_folds) %>%
group_by(id) %>%
roc_curve(high_price_indicator, .pred_class) %>%
autoplot()
Appreciate anyone's guidance!
Below are my steps. Appreciate if anyone can help me understand where I am going wrong to plot the predictions against the outcome variable.
Okay, figured it out. I was trying to plot two categorical variables against each other, but the roc_cuve requires one truth column and one column with the probabilities for it.
By unnesting the .predictions
column in the resampled table fit_folds
you can see that there are three columns with results .pred_high
, .pred_low
and .pred_class
. the high
and low
tag correspond to the high_price_indicator
column.
.pred_class
has the character outcome of the prediction, and the .pred_low
and .pred_high
have the probabilities outcomes. In Julia Silge's example, these columns are represented as .pred_n
and pred_y
.
So, when you plot a numerical probabilities column against the truth column you get the graph.
Below is the code
collect_predictions(fit_folds) %>%
group_by(id) %>%
roc_curve(high_price_indicator,.pred_high) %>%
autoplot()