[SOLVED] Yolo training on ubuntu crashes with error buffer overflow

Yolo training on ubuntu crashes with error buffer overflow

It worked fine until today but now, when I run darknet training on my computer, it starts running for a little bit, but overflows before completing even one epoch. I watched the GPU?CPU memory and both are okay (little spike at the beginning but then normal until crash). I'm using AlexeyAB fork of darknet.

The error is :

v3 (mse loss, Normalizer: (iou: 0.75, obj: 1.00, cls: 1.00) Region 106 Avg (IOU: 0.000000), count: 6, class_loss = 13507.688477, iou_loss = 1962709219740518995334266880.000000, total_loss = 1962709219740518995334266880.000000 
 total_bbox = 5305, rewritten_bbox = 0.245052 % 

 seen 64, trained: 266 K-images (4 Kilo-batches_64) 
Learning Rate: 0.001, Momentum: 0.9, Decay: 0.0005
 Detection layer: 82 - type = 28 
 Detection layer: 94 - type = 28 
 Detection layer: 106 - type = 28 
 If error occurs - run training with flag: -dont_show 
Resizing, random_coef = 1.40 

 608 x 608 
 try to allocate additional workspace_size = 1138.23 MB 
 CUDA allocate done! 
Loaded: 0.000037 seconds

 (next mAP calculation at 230 iterations) 
 131: 193631524938497389469106176.000000, 193631524938497389469106176.000000 avg loss, 0.001000 rate, 38.027819 seconds, 268288 images, -1.000000 hours left
*** buffer overflow detected ***: terminated

I saw multiple cases of this error and the problem came from the labels containing \r. I checked my files with python and it was okay :

0 0.2846875 0.001953125 0.008125 0.00390625\n0 0.6734375 0.30859375 0.014375 0.1953125\n0 0.298125 0.626953125 0.01 0.16015625\n0 0.285625 0.912109375 0.0125 0.17578125\n

What can I do to correct the buffer overflow ?

Solution

Well I found what caused the error. When I trained the model, I made it save at every 10 epochs so that it would be easy to continue after a crash by simply inputting the last calculated weights as starting point. The last weights were causing the problem. When I changed the starting file, it worked fine again.