I am writing a tidymodels engine to fit the joint LASSO described here and implemented in the fuser package. This is a LASSO model for regression that allows partial sharing of information between groups, in this case tissues in a biomedical RNAseq experiment.
An important difference between this and the regular LASSO is that while it also takes as arguments a matrix of predictors X
and a vector of outcomes Y
, it also takes a vector of group indicators, which in my case would be the tissue that each observation belongs to.
I would like to pass in a column name or role
to the groups
argument of the fit function, so that in a resampling/cross-validation loop, the groups argument corresponds to the grouping of the subset of data used in the resampling loop.
I can't find any information in the parsnip docs describing how to specify extra variables/selections/etc in the fit function. If I try to add it to the engine definition as I've done below, then when I try to use the engine in a normal tidymodels pipeline, I get an error that groups
is missing (expected, given I haven't specified how to pass it to the fitting function).
library("tidymodels")
## the content of src/tidymodels-utils.R is at the bottom of this question
## beginning with set_new_model
source(here("src/tidymodels-utils.R"))
nfeats <- 100
nsamples <- 120
ngroups <- 2
group <- sample(letters[1:ngroups], nsamples, replace=TRUE)
predictors <- matrix(rnorm(nfeats*nsamples), nrow = nsamples, ncol = nfeats,
dimnames = list(paste("Sample", 1:nsamples), paste("Feature", 1:nfeats))
)
outcome <- rnorm(nsamples)
## tidymodels wants a dataframe input
input <- data.frame(
predictors,
group = group,
outcome = nsamples
)
norm_recipe <- recipe(input) %>%
update_role(matches("Feature"), new_role = "predictor") %>%
update_role(outcome, new_role = "outcome") %>%
update_role(group, new_role = "group") %>%
## some steps to center/scale/select predictors and outcome omitted here
prep()
glmnet_model <- joint_lasso(penalty = tune(), fusion = tune(),
groups = matches("group")) %>%
set_engine("fusedLassoProximal")
workflow <- workflow() %>%
add_recipe(norm_recipe) %>%
add_model(glmnet_model)
folds <- vfold_cv(input, v = nfolds, repeats = nrepeats)
glmn_set <- parameters(
penalty(),
fusion()
)
glmn_grid <- grid_regular(glmn_set, levels = 3)
ctrl <- control_grid(save_pred = TRUE, verbose = TRUE)
results <- workflow %>%
tune_grid(
resamples = folds,
metrics = metric_set(rmse),
grid = glmn_grid,
control = ctrl
)
# Error in `mutate()`:
# ℹ In argument: `object = purrr::map(call_info, eval_call_info)`.
# Caused by error in `purrr::map()`:
# ℹ In index: 1.
# Caused by error in `.f()`:
# ! Error when calling fusedLassoProximal(): Error in unique(groups) : argument "groups" is missing, with no default
# Run `rlang::last_trace()` to see where the error occurred.
Any ideas where this will fit? There's a lot of boilerplate but my journeys through the various ?
pages hasn't shown anything obvious.
Worth noting that the prediction method will also need this group information about the test data in each resample, because the return value of fuser::fusedProximalLasso
is simply a matrix of coefficients, one column for each unique value of group
.
The whole set of function calls to define the engine (and the fusion()
penalty param) is as follows:
set_new_model("joint_lasso")
set_model_mode(model = "joint_lasso", mode = "regression")
set_model_engine(
"joint_lasso",
mode = "regression",
eng = "fusedLassoProximal"
)
set_dependency("joint_lasso",
eng = "fusedLassoProximal", pkg = "fuser"
)
set_model_arg(
model = "joint_lasso",
eng = "fusedLassoProximal",
parsnip = "penalty",
original = "lambda",
func = list(pkg = "fuser", fun = "fusedLassoProximal"),
has_submodel = FALSE
)
set_model_arg(
model = "joint_lasso",
eng = "fusedLassoProximal",
parsnip = "groups",
original = "groups",
func = list(pkg = "fuser", fun = "fusedLassoProximal"),
has_submodel = FALSE
)
joint_lasso <- function(
mode = "regression",
penalty = NULL,
fusion = NULL,
groups = NULL) {
if (mode != "regression") {
rlang::abort("`mode` should be 'regression'")
}
args <- list(
groups = rlang::enquo(groups),
penalty = rlang::enquo(penalty),
fusion = rlang::enquo(fusion)
)
# Save some empty slots for future parts of the specification
new_model_spec(
"joint_lasso",
args = args,
eng_args = NULL,
mode = mode,
method = NULL,
engine = NULL
)
}
set_fit(
model = "joint_lasso",
eng = "fusedLassoProximal",
mode = "regression",
value = list(
interface = "matrix",
protect = c("X", "Y", "groups"),
func = c(pkg = "fuser", fun = "fusedLassoProximal"),
defaults = list()
)
)
set_encoding(
model = "joint_lasso",
eng = "fusedLassoProximal",
mode = "regression",
options = list(
predictor_indicators = "traditional",
compute_intercept = TRUE,
remove_intercept = TRUE,
allow_sparse_x = FALSE
)
)
predict_joint_lasso <- function(object, new_data=NULL) {
## some code here to use the new data
}
pred_info <- list(
pre = NULL,
post = NULL,
func = c(fun = "predict_joint_lasso"),
args =
list(
object = quote(object$fit),
new_data = quote(new_data)
)
)
set_pred(
model = "joint_lasso",
eng = "fusedLassoProximal",
mode = "regression",
type = "numeric",
value = pred_info
)
show_model_info("joint_lasso")
fusion <- function(range=c(0, 2), trans = NULL) {
new_quant_param(
type = "double",
range = range,
inclusive = c(TRUE, TRUE),
trans = trans,
label = c(fusion = "Fusion parameter"),
finalize = NULL
)
}
I think that you should write a wrapper function to provide a different interface for that function. Your point about resampling is spot-on and you need a way to point fuser's functions to specific columns in the data.
We had similar issues with the gee function, which also had an argument for a length n
vector of groups.
In that case, we used the little documented "specials" feature of base R formulas.
Here is a gist that has an implementation if you want to use it. Basically, you would require a formula to be used that has a groups()
function in it. The variable in that function defines the groups. It's all there.
One other thing (addressed in the gist): the G
argument is data-dependent. If a user uses a recipe or some tool that might reduce features, then the dimensions might be off. There is a function that will take the G
argument as an expression and then evaluate it with the number of columns that are actually in the data. There are other (and probably better) approaches, but I thought I'd point that out.
Feel free to put in a parsnip R if there is more that we can help with.