[SOLVED] PyTorch LSTM: Should I use mini batch size or single batch but randomly remove n observation when calculating loss?

PyTorch LSTM: Should I use mini batch size or single batch but randomly remove n observation when calculating loss?

As I understood, the idea of mini batch size is equivalent with fitting the model to only a portion of all training data at each step (one epoch consists of many steps, depending on the batch size) to avoid overfitting

So if I use full batch (all data) and randomly remove n observations when calculating loss at each epoch. Is this equivalent with the idea of mini batch size.

I am using LSTM neural network and train for time series data. Here, lets assume that I have unlimited storage and computational capacity

Thanks for any comments

Solution

Usually a full batch does not fit on your GPU, however, a minibatch does. Looking at the other extreme, a minibatch size of 1, it is obvious that the gradient will be very noise, since it depends on a single input. A noisy gradient will cause the optimizer to follow a very wiggly path through the search space, which is not efficient.

I cant follow your argument here:

So if I use full batch (all data) and randomly remove n observations when calculating loss at each epoch. Is this equivalent with the idea of mini batch size.

This is what I understood:

You evaluate the network on a full batch
You calculate the loss on a subset
Therefore gradients will only be available for this subset
The optimizer applies the gradient step and will only consider the subset
You have the same result as using a minibatch size of all-n in the first place

You should choose a batch size that is typical for your problem (maybe check in papers on your topic). Too big batch sizes will make the gradient smooth, but stochasticity has benefits, too, as it helps you to escape local minima and makes your network more robust in unseen cases (generalization).