rtidymodels

Tidymodel grid search for semi-supervised algorithms


I need to use tidymodels to perform a grid search of hyperparameters for a few semi-supervised algorithms as implemented in the package SSRL https://dicits.ugr.es/software/SSLR/index.html.

Let us take the LinearTSVMSSLR() function as an example. It has two hyperparameters of interest to me, C and Cstar. So I want to do a grid search for four values for each hyperparameter.

This will give three problems with the standard way of doing grid search in tidymodels: the resample, the fit, and the predict.

Let us assume that the labeled data is the data frame dat and the column target is the classification target. The grid search must be performed in this labeled subset of my data.

The way the algorithms in SSLR work is that one sets the target value to NA to indicate that the data is unlabeled. Also, to evaluate the different values of the hyperparameters, I will be using transductive testing, that is, evaluating how correct the algorithm is in computing the labels of the given unlabeled data (as opposed to evaluating the correctness on new data points).

So what I need in terms of fit and predict is something like:

unlabeled = caret::createDataPartition(dat$target, p = .5, list = FALSE)

oldtarget = dat$target
dat$target[unlabeled] = NA
fitted_mod = model |> fit(dat)

# For testing
results = predict(fitted_mod, dat[unlabeled,])

And the compare results with oldtarget[unlabeled]

I think the simplest solution would be to resample and divide the data frame into training and test sets (in my case, since I want 50% unlabeled, a repeated 2-fold). However, the fit phase would need access to both the training and test data frames, and would rbind them, setting the target column to NA in the process. The test/predict phase would be the same. But I do not know how to do this.

There are probably other solutions.

Does anyone know how to solve this problem?


Solution

  • SSRL was released about a month after the initial release of the tune package. I don't think that it was designed with the non-parsnip part of tidymodels in mind.

    The only solution I can think of is to use tune_grid() or similar methods to fit the models and the extract argument in the control functions to return the fitted workflow.

    From there, you can evaluate the model in any way you want with the data of your choice.

    I'm sure that a better approach could happen, so I started an issue in the tune repo if anyone has thoughts or time to contribute.