[SOLVED] R binary metrics by percentile

R binary metrics by percentile

Is there a tidyverse/tidymodels (or base R) way to compute binary classification metrics by adjusting the threshold for a specific positive percentile?

The tidymodels guide suggests preparing a prediction probabilities dataframe which produces positive probabilities (.pred_1) and also includes the actual classes Day90:

> rf_fit %>% predict(test, type="prob") %>% bind_cols(test %>% select(Day90))
# A tibble: 31,586 × 3
   .pred_1 .pred_0 Day90
     <dbl>   <dbl> <fct>
 1  0.296    0.704 0    
 2  0.296    0.704 0    
 3  0.136    0.864 0    
 4  0.0690   0.931 0    
 5  0.0882   0.912 0    
 6  0.0948   0.905 0    
 7  0.157    0.843 0    
 8  0.0572   0.943 0    
 9  0.108    0.892 0    
10  0.0466   0.953 0    
# ℹ 31,576 more rows
# ℹ Use `print(n = ...)` to see more rows

type="quantile" is promising but not available for parsnip's rand_forest().

Ideally there is a function that takes a positive percentile, say 20%, and finds a probability threshold k that results in about 20% predicted positive. I could sort the probabilities and perform a linear or binary search on k, but I'm sure this is already implemented in a more robust way. dplyr::percent_rank() also seems promising.

Solution

use the quantile function:

quantile(rf_fit$.pred_1, 0.8)