I am trying to train a time-independent Cox model on a dataset of ~750,000 rows, as well as a time-dependent one on several million rows. I have 19 variables, some of which are binary and some continuous. I have been modelling the continuous variables with restricted cubic splines (rcs
function, pspline
was giving me numerical issues for some reason) as an easy way to deal with non-linearity; overfitting hasn't been a problem from comparing the concordance index on a test set. However, I want to do variable selection on my models. Preferably (LASSO) regularization, as I think best subset selection will likely be too computationally heavy. The survival package in R has a ridge
function, but this needs to be applied on separate variables and I'd like to do it over the whole model. There's also cv.glmnet
with family="cox"
, but this doesn't allow me to use splines (I believe). Is there a nice way to do this?
The penalized
package for R does Cox PH regressions with L1 (lasso) and L2 (ridge) penalties. It appears to work with bs
from the splines
package (but I have not tested it with rcs
or other splines). I did fit a model using pslines
in the formula and it gave an answer (and and optimum L1 penalty), but I don't know how this compares to the regular penalized spline approach in survival
.