As I understood, the idea of mini batch size is equivalent with fitting the model to only a portion of all training data at each step (one epoch consists of many steps, depending on the batch size) to avoid overfitting
So if I use full batch (all data) and randomly remove n observations when calculating loss at each epoch. Is this equivalent with the idea of mini batch size.
I am using LSTM neural network and train for time series data. Here, lets assume that I have unlimited storage and computational capacity
Thanks for any comments
Usually a full batch does not fit on your GPU, however, a minibatch does. Looking at the other extreme, a minibatch size of 1, it is obvious that the gradient will be very noise, since it depends on a single input. A noisy gradient will cause the optimizer to follow a very wiggly path through the search space, which is not efficient.
I cant follow your argument here:
So if I use full batch (all data) and randomly remove n observations when calculating loss at each epoch. Is this equivalent with the idea of mini batch size.
This is what I understood:
all-n
in the first placeYou should choose a batch size that is typical for your problem (maybe check in papers on your topic). Too big batch sizes will make the gradient smooth, but stochasticity has benefits, too, as it helps you to escape local minima and makes your network more robust in unseen cases (generalization).