python tensorflow object-detection object-detection-api efficientnet

Why is training loss oscilating up and down?

I am using the TF2 research object detection API with the pre-trained EfficientDet D3 model from the TF2 model zoo. During training on my own dataset I notice that the total loss is jumping up and down - for example from 0.5 to 2.0 a few steps later, and then back to 0.75:

So all in all this training does not seem to be very stable. I thought the problem might be the learning rate, but as you can see in the charts above, I set the LR to decay during the training, it goes down to a really small value of 1e-15, so I don't see how this can be the problem (at least in the 2nd half of the training).

Also when I smooth the curves in Tensorboard, as in the 2nd image above, one can see the total loss going down, so the direction is correct, even though it's still on quite a high value. I would be interested why I can't achieve better results with my training set, but I guess that is another question. First I would be really interested why the total loss is going up and down so much the whole training. Any ideas?

PS: The pipeline.config file for my training can be found here.

Solution

In your config it states that your batch size is 2. This is tiny and will cause a very volatile loss.

Try increasing your batch size substantially; try a value of 256 or 512. If you are constrained by memory, try increasing it via gradient accumulation.

Gradient accumulation is the process of synthesising a larger batch by combining the backwards passes from smaller mini-batches. You would run multiple backwards passes before updating the model's parameters.

Typically, a training loop would like this (I'm using pytorch-like syntax for illustrative purposes):

for model_inputs, truths in iter_batches():
    predictions = model(model_inputs)
    loss = get_loss(predictions, truths)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

With gradient accumulation, you'll put several batches through and then update the model. This simulates a larger batch size without requiring the memory to actually put a large batch size through all at once:

accumulations = 10

for i, (model_inputs, truths) in enumerate(iter_batches()):
    predictions = model(model_inputs)
    loss = get_loss(predictions, truths)
    loss.backward()
    if (i - 1) % accumulations == 0:
        optimizer.step()
        optimizer.zero_grad()

Reading