I am looking into using R's targets
but I am struggling to have it accept multiple file outputs.
For example, I want to be able to take a dataset, create a train/test split and write each dataset to a separate file.
An MWE would be
_targets.R
library(targets)
source("R/functions.R")
set.seed(124)
list(
# created using write.csv(mtcars, "data/mtcars.csv")
tar_target(raw_data, "data/mtcars.csv", format = "file"),
tar_target(data, read.csv(raw_data),
# this throws an error here:
tar_target(train_test, split_dataset(data), format = "file"),
# this only shows how I would try to use the train/test datasets
tar_target(model, train_model(train_test)),
tar_target(eval, eval_model(model, train_test))
)
where split_dataset()
is defined in R/functions.R
split_dataset <- function(data) {
idx <- sample.int(nrow(data), 0.8 * nrow(data))
train <- data[idx, ]
test <- data[-idx, ]
write.csv(train, "data/train.csv")
write.csv(test, "data/test.csv")
return(c("data/train.csv", "data/test.csv"))
}
One alternative would be to use a list list(train = train, test = test)
but I want to be able to access either dataset if possible and save the datasets as separate files.
Another alternative approach would be to define the index in the targets list, split the dataset and write each dataset in a separate target. If possible I would like to condense the steps into one (as shown above) to make the targets file easier to understand.
I recommend appending idx
as a column to data
and then filtering on it later for the train
and test
targets. Also, you do not need format = "file"
to be able to access datasets later. You can use tar_read()
or tar_load()
for that. Sketch:
library(targets)
library(tibble)
dir.create("data")
write.csv(mtcars, "data/mtcars.csv")
tar_script({
library(tibble)
split_data <- function(data) {
idx <- sample.int(n = nrow(data), size = 0.8 * nrow(data))
data$is_training <- seq_len(nrow(data)) %in% idx
as_tibble(data)
}
list(
tar_target(raw_data, "data/mtcars.csv", format = "file"),
tar_target(data, split_data(read.csv(raw_data)), format = "feather"),
tar_target(train, data[data$is_training, ], format = "feather"),
tar_target(test, data[!data$is_training, ], format = "feather")
)
})
tar_visnetwork()
tar_make()
#> ● run target raw_data
#> ● run target data
#> ● run target test
#> ● run target train
#> ● end pipeline
tar_read(train)
#> # A tibble: 25 x 13
#> X mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
#> 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda RX4 … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 Hornet 4 D… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 Merc 280C 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
#> # … with 15 more rows, and 1 more variable: is_training <lgl>
tar_read(test)
#> # A tibble: 7 x 13
#> X mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
#> 1 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> 2 Merc 450SLC 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
#> 3 Lincoln Con… 10.4 8 460 215 3 5.42 17.8 0 0 3 4
#> 4 Fiat 128 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
#> 5 AMC Javelin 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2
#> 6 Fiat X1-9 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
#> 7 Lotus Europa 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
#> # … with 1 more variable: is_training <lgl>
Created on 2021-03-30 by the reprex package (v1.0.0)