I am currently performing regression modeling, with a dataset that has number of features (p) higher than observations (n).
Typically p = 10000
and n = 30
. Furthermore, I'd like to test many models and find the best one.
What I'm doing now is first to eliminate those features. Reducing it from 10K to 20-30, using step_select_mrmr() or step_select_vip(). I achieved that by placing it at the top of my pipeline. Then I would proceed with testing many models.
Is this approach reasonable?
It is reasonable as long as you are using resampling or a validation set to make sure that there is no information leakage.
We hope to have more recipe functions for supervised filters later this year but Steven's are great.