I am struggling with interoperability of R packages torch and targets. For example, if I define a target of class dataset
(from torch), then it is impossible to read it with tar_read
(from targets), and I cannot use it in other targets.
Here is my dataset
generator nn_dataset
:
library(torch)
library(targets)
library(dplyr)
library(tidymodels)
nn_dataset <-
dataset(
name = "nn_dataset",
initialize = function(df) {
data <- self$prepare_data(df)
self$tele <- data$x$tele
self$class <- data$x$class
self$y <- data$y
},
.getitem = function(i) {
list(
x = list(
tele = self$tele[i, ],
class = self$class[i, ]
),
y = self$y[i, ]
)
},
.length = function() {
self$y$size()[[1]]
},
prepare_data = function(df) {
target_col <-
df$claim_ind_cov_1_2_3_4_5_6 %>%
as.integer() %>%
`-`(1) %>%
as.matrix()
tele_cols <-
df %>%
select(starts_with(c("h_", "p_", "vmo", "vma"))) %>%
as.matrix()
class_df <- select(df, expo:years_licensed, distance)
rec_class <-
recipe(~ ., data = class_df) %>%
step_impute_median(commute_distance, years_claim_free) %>%
step_other(all_nominal(), threshold = 0.05) %>%
step_dummy(all_nominal()) %>%
prep()
class_cols <- juice(rec_class) %>% as.matrix()
list(
x = list(
tele = torch_tensor(tele_cols),
class = torch_tensor(class_cols)
),
y = torch_tensor(target_col)
)
}
)
If I define the following target:
tar_target(
name = target_name,
command = nn_dataset(valid_df)
)
where valid_df
is a tibble, and if I then try to read it:
tar_read(target_name)
then I get this error:
Error in cpp_tensor_dim(self$ptr) : external pointer is not valid
I have also tried this:
tar_target(
name = target_name,
command = nn_dataset(valid_df),
format = "torch"
)
and this:
tar_torch(
name = target_name,
command = nn_dataset(valid_df)
)
but neither worked.
The format = "torch"
capability of targets
relies on torch::torch_save()
and torch::torch_load()
, and these functions in torch
do not work on the custom R6
classes that come out of MyDataset(mtcars)
in your example. On top of that, torch
data is "non-exportable", and as discussed at https://books.ropensci.org/targets/targets.html#saving and https://cran.r-project.org/web/packages/future/vignettes/future-4-non-exportable-objects.html, that data cannot simply be saved to disk with something like saveRDS()
(which is the default in targets
). I do not know torch
well enough to recommend something specific, but a solution would require figuring out the R code that will safely save and load one of these objects, then creating your own custom storage format using tar_format()
. The code at https://docs.ropensci.org/targets/reference/tar_format.html#ref-examples has an example for Keras models.
A better alternative would actually be to avoid saving R6 objects altogether because those are really pieces of code that do not hash well. If you can restructure the pipeline to save simpler versions of the data and only re-create those R6
classes on an as-needed basis, that would be much better, especially if those R6
classes take no time at all to create from e.g. a data frame. So you first target could be the mtcars
data frame, and then the model-fitting target could call MyDataset(mtcars)
, fit the model, and return easy-to-save output generated from that fitted model.