Using DataLoader for efficient model prediction

I'm trying to understand the role/utility of batch_size in torch beyond model training. I already have a trained model, where the batch_size was optimized as a hyperparameter. I want to use the model to make predictions in new data. I'm following the same format of the implementation that was used for training for pre-processing my data. But instead of using the same batch_size that was optimized and looping through the different batches, I'm using batch_size equal to size of the data set (shuffle=False in this case), and passed the entire data set for prediction once.

I was wondering about the correctness of such approach, I don't have a lot of experience with torch, and couldn't find a lot of information about the most efficient ways of using the trained model to make prediction.

Here is a simplified version of the predict method I implemented in the model class, and it illustrates the use of DataLoader I'm referring to. I must say I noticed a significant speed up with this approach, over looping through the data.

def predict(self, X):
    X_loader = DataLoader(
        X,
        batch_size=X.shape[0],
        shuffle=False,
    )

    batch = next(iter(X_loader))
    with torch.no_grad():
        predictions = model(batch)

    return predictions

Thank you

EDIT

My question is whether or not it is correct to use the DataLoader in this way for making predictions, or I have to use the batch_size value optimized during the training process? In other words, using a different batch_size in prediction from what was used for training can affect the result?

Solution

Short answer: The batch size during inference (i.e. for making predictions on new samples with a trained model), under all reasonable circumstances, should be free to choose independent of the batch size during training.

Long answer

The role of the batches is different during model training and inference, so usually different batch sizes are applied in the two settings.

Batch size during training

During training, it is the main role of the batches to provide an estimate of the complete data distribution, so that the gradient, which is applied when updating the model parameters via backpropagation, is a good approximation of the true gradient, so that updating the model parameters with it indeed leads towards an optimum. As a consequence, the batch size during training is often chosen as large as possible (i.e. limited by the hardware that is used for model training). Other considerations and parameters, such as learning rate and optimization approach, also play a role there though, so several findings regarding the optimal choice of batch size have been published (see e.g. this and this paper – I am pretty sure that there are also more recent ones).

Batch size during inference

During inference, the only benefit of batching samples rather than processing them individually is increased parallelism (as the same calculations are applied to all samples) and thus, potentially, increased efficiency. The batch size itself is usually determined by other factors than during training: On the one hand, since no gradients have to be calculated any more, usually now larger batch sizes are possible if the same hardware that was used for training is still being used. On the other hand, inference often happens on less potent hardware (think e.g. of mobile devices) or with different latency requirements (think e.g. of real-time applications), so a smaller batch size might be chosen instead. What is important in either scenario: Once a trained model is applied, it should treat different samples independently, so the result for each individual sample should remain the same, no matter which and how many other samples are in its batch.

Of course, technically, everyone is free to choose to design and implement a model that behaves differently, but I would not know what would be the motivation for that (which is what I am referring to with under all reasonable circumstances in the short answer above). Crucially, what I do not mean here is the individual inputs to something like video models or language models where the sequence of input frames/tokens indeed plays a role: here, a "sample" is more than the individual video frame or language token – and the batch size should still not have an influence on the result during inference.

In any case, a sanity check would be: apply the same model with different batch sizes to the same samples for inference, and compare their outputs: if, for the same input, outputs are identical (within the limits of floating-point math), batch size should be free to choose.