It worked fine until today but now, when I run darknet training on my computer, it starts running for a little bit, but overflows before completing even one epoch. I watched the GPU?CPU memory and both are okay (little spike at the beginning but then normal until crash). I'm using AlexeyAB fork of darknet.
The error is :
v3 (mse loss, Normalizer: (iou: 0.75, obj: 1.00, cls: 1.00) Region 106 Avg (IOU: 0.000000), count: 6, class_loss = 13507.688477, iou_loss = 1962709219740518995334266880.000000, total_loss = 1962709219740518995334266880.000000
total_bbox = 5305, rewritten_bbox = 0.245052 %
seen 64, trained: 266 K-images (4 Kilo-batches_64)
Learning Rate: 0.001, Momentum: 0.9, Decay: 0.0005
Detection layer: 82 - type = 28
Detection layer: 94 - type = 28
Detection layer: 106 - type = 28
If error occurs - run training with flag: -dont_show
Resizing, random_coef = 1.40
608 x 608
try to allocate additional workspace_size = 1138.23 MB
CUDA allocate done!
Loaded: 0.000037 seconds
(next mAP calculation at 230 iterations)
131: 193631524938497389469106176.000000, 193631524938497389469106176.000000 avg loss, 0.001000 rate, 38.027819 seconds, 268288 images, -1.000000 hours left
*** buffer overflow detected ***: terminated
I saw multiple cases of this error and the problem came from the labels containing \r
. I checked my files with python and it was okay :
0 0.2846875 0.001953125 0.008125 0.00390625\n0 0.6734375 0.30859375 0.014375 0.1953125\n0 0.298125 0.626953125 0.01 0.16015625\n0 0.285625 0.912109375 0.0125 0.17578125\n
What can I do to correct the buffer overflow ?
Well I found what caused the error. When I trained the model, I made it save at every 10 epochs so that it would be easy to continue after a crash by simply inputting the last calculated weights as starting point. The last weights were causing the problem. When I changed the starting file, it worked fine again.