[SOLVED] SimpleTransformers model not training on whole data (As shown in brackets under epochs bar)?

SimpleTransformers model not training on whole data (As shown in brackets under epochs bar)?

My data has 1751 sentences however when training a number appears under the epochs bars. Sometimes it is 1751 which makes sense it's the number of sentences I have, but most of the times it's 50% the number of data (sentences I have as shown in the figure below).

I tried to look in the documentation to understand if the number should be the same as my training set size but I couldn't find an answer.

I am using Kaggle with GPU backend. Does this means that the model is indeed not training on all data?

Solution

In short: no, it is training on all data.

First let's look at some of the parameters:

num_of_train_epochs: 4: your setting, meaning the whole dataset would be trained 4 times. Which is why you have 4 bars in the output.

train_batch_size: 8: this is the default setting, meaning that for each update on the weights, you use 8 records in your training data (out of a total of 1751)

So this means, you have a total of 1751/8 = 218.875 batches per epoch, which is the 219/219 you see in the output.

The 876 that you sees in the bottom simply means it went through a total of 219(batch per epoch) * 4(number of epoch) = 876 number of batches/updates.

One way to prove this is to change num_of_train_epochs to 1. And you should see 219 instead of 876.

Definition of batch and epoch:

The batch size is a number of samples processed before the model is updated.

The number of epochs is the number of complete passes through the training dataset.