I am using the TF2 research object detection API with the pre-trained EfficientDet D3 model from the TF2 model zoo. During training on my own dataset I notice that the total loss is jumping up and down - for example from 0.5 to 2.0 a few steps later, and then back to 0.75:
So all in all this training does not seem to be very stable. I thought the problem might be the learning rate, but as you can see in the charts above, I set the LR to decay during the training, it goes down to a really small value of 1e-15, so I don't see how this can be the problem (at least in the 2nd half of the training).
Also when I smooth the curves in Tensorboard, as in the 2nd image above, one can see the total loss going down, so the direction is correct, even though it's still on quite a high value. I would be interested why I can't achieve better results with my training set, but I guess that is another question. First I would be really interested why the total loss is going up and down so much the whole training. Any ideas?
PS: The pipeline.config
file for my training can be found here.
In your config it states that your batch size is 2. This is tiny and will cause a very volatile loss.
Try increasing your batch size substantially; try a value of 256
or 512
. If you are constrained by memory, try increasing it via gradient accumulation.
Gradient accumulation is the process of synthesising a larger batch by combining the backwards passes from smaller mini-batches. You would run multiple backwards passes before updating the model's parameters.
Typically, a training loop would like this (I'm using pytorch-like syntax for illustrative purposes):
for model_inputs, truths in iter_batches():
predictions = model(model_inputs)
loss = get_loss(predictions, truths)
loss.backward()
optimizer.step()
optimizer.zero_grad()
With gradient accumulation, you'll put several batches through and then update the model. This simulates a larger batch size without requiring the memory to actually put a large batch size through all at once:
accumulations = 10
for i, (model_inputs, truths) in enumerate(iter_batches()):
predictions = model(model_inputs)
loss = get_loss(predictions, truths)
loss.backward()
if (i - 1) % accumulations == 0:
optimizer.step()
optimizer.zero_grad()
Reading