I have been trying to tackle a regression problem by training a neural network to predict a continuous variable using r-torch. My question pertains to the syntax used to achieve this.
When initializing the dataset for torch, I have to declare what 'x' and 'y' are. I understand that 'y' is the target variable that the model will aim to predict and that 'x' should contain the predictor variables whose data would be used to make the predictions. What I am unsure about is whether 'y' should be included in the 'x' matrix. Here is what my code looks like so far:
torch_ds <- dataset(
name = "torch_dataset()",
initialize = function(df) {
vars <- df %>%
select_if(is.factor) %>%
colnames() # get names of categorical variables (all have been turned to factors)
df[vars] <- lapply(df[vars], as.numeric)
self$x <- as.matrix(df) %>% # This x matrix contains the target variable, y
torch_tensor()
self$y <- torch_tensor(as.matrix(df$target)) # Defining the target variable
},
.getitem = function(i) {
list(x = self$x[i, ], y = self$y[i])
},
.length = function() {
dim(self$x)[1]
}
)
# Convert the train and test data frames to torch-compatible tensors:
train_tensor <- torch_ds(train_df)
test_tensor <- torch_ds(test_df)
# Turn those tensors into dataloader objects:
train_dl <- dataloader(train_tensor, batch_size = 100, shuffle = FALSE)
test_dl <- dataloader(test_tensor, batch_size = 100, shuffle = FALSE)
# Save the dimensions entering/exiting the layers of the neural network:
d_in <- 44 # Number of features (columns in dataset)
# dimensionality of hidden layer
d_hidden <- 500
# output dimensionality (number of predicted features)
d_out <- 1
# Structure of the network:
net <- nn_module(
initialize = function(d_in, d_hidden, d_out) {
self$net <- nn_sequential(
nn_linear(d_in, d_hidden),
nn_relu(),
nn_linear(d_hidden, d_out)
)
},
forward = function(x) {
self$net(x)
}
)
# Fit the network model to the train dataloader object:
fitted <- net %>%
setup(loss = nn_mse_loss(), optimizer = optim_adam, metrics = list(luz_metric_mae(), luz_metric_rmse(), luz_metric_mse())) %>%
set_hparams(
d_in = d_in,
d_hidden = d_hidden, d_out = d_out
) %>%
fit(train_dl, epochs = 1000)
No, don't include the response variable, just include the vector of predictors.
There are some models where you would pass the entire matrix and then specify which is the target, but in general unless you are sure about that you don't want the response included as a variable when you train the model. In this case, since you are passing a vector you can remove that column from the matrix.
If you haven't come across it you could check out Deep Learning and Scientific Computing with R torch by Sigrid Keydana. It's free if you google it. Chapter 13.2 contains an example dataloader using Palmer Penguins.