darknetmulti-gpuyolov4

cuDNN Error: CUDNN_STATUS_BAD_PARAM. Can someone tell what does this mean, why this is occurring?


I am trying to train an object detection model with yolov4 using multiple GPUs (tesla T4), but between around 1000 or 2000 iterations it's giving the below error.


(next mAP calculation at 1214 iterations)
1216: 1.498149, 1.476265 avg loss, 0.010440 rate, 2.871675 seconds, 311296 images, 4.278861 hours left
4Darknet error location: ./src/convolutional_kernels.cu, forward_convolutional_layer_gpu(), line #543
cuDNN Error: CUDNN_STATUS_BAD_PARAM: No such file or directory
backtrace (13 entries)
1/13: ./darknet(log_backtrace+0x38) [0x55e9c42cfc18]
2/13: ./darknet(error+0x3d) [0x55e9c42cfcfd]
3/13: ./darknet(+0x834b0) [0x55e9c42d24b0]
4/13: ./darknet(cudnn_check_error_extended+0x7c) [0x55e9c42d2a9c]
5/13: ./darknet(forward_convolutional_layer_gpu+0x2c2) [0x55e9c43b0d12]
6/13: ./darknet(forward_network_gpu+0x101) [0x55e9c43c4d41]
7/13: ./darknet(network_predict_gpu+0x131) [0x55e9c43c7711]
8/13: ./darknet(validate_detector_map+0xa2e) [0x55e9c435afce]
9/13: ./darknet(train_detector+0x17f8) [0x55e9c435db48]
10/13: ./darknet(run_detector+0xa04) [0x55e9c4361eb4]
11/13: ./darknet(main+0x341) [0x55e9c428c311]
12/13: /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f5165d4d083]
13/13: ./darknet(_start+0x2e) [0x55e9c428e58e]
Resizing to initial size: 608 x 608  try to allocate additional workspace_size = 70.08 MB
 CUDA allocate done!
 try to allocate additional workspace_size = 70.08 MB
 CUDA allocate done!
 try to allocate additional workspace_size = 70.08 MB
 CUDA allocate done!
 try to allocate additional workspace_size = 70.08 MB
 CUDA allocate done!

 calculation mAP (mean average precision)...
 Detection layer: 30 - type = 28
 Detection layer: 37 - type = 28
 Detection layer: 44 - type = 28

 cuDNN status Error in: file: ./src/convolutional_kernels.cu function: forward_convolutional_layer_gpu() line: 543

 cuDNN Error: CUDNN_STATUS_BAD_PARAM

Here are my config file details, the training on one GPU is going good with the same config file.

[net]
# Testing
#batch=1
#subdivisions=1
# Training
batch=64
subdivisions=8
width=608
height=608
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.00261
burn_in=1000
max_batches = 10000
policy=steps
steps=8000,9000
scales=.1,.1

[convolutional]
batch_normalize=1
filters=32
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=64
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky

[route]
layers=-1
groups=2
group_id=1

[convolutional]
batch_normalize=1
filters=32
size=3
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=32
size=3
stride=1
pad=1
activation=leaky

[route]
layers = -1,-2

[convolutional]
batch_normalize=1
filters=64
size=1
stride=1
pad=1
activation=leaky

[route]
layers = -6,-1

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky

[route]
layers=-1
groups=2
group_id=1

[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky

[route]
layers = -1,-2

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[route]
layers = -6,-1

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[route]
layers=-1
groups=2
group_id=1

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky

[route]
layers = -1,-2

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[route]
layers = -6,-1

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

##################################

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=21
activation=linear



[yolo]
mask = 6,7,8
anchors =  6, 14,  14, 33,  34, 63,  55,134, 108, 77,  98,162, 127,277, 280,179, 274,405
classes=2
num=9
jitter=.3
scale_x_y = 1.05
cls_normalizer=1.0
iou_normalizer=0.07
iou_loss=ciou
ignore_thresh = .7
truth_thresh = 1
random=1
resize=1.5
nms_kind=greedynms
beta_nms=0.6

[route]
layers = -4

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[upsample]
stride=2

[route]
layers = -1, 23

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=21
activation=linear

[yolo]
mask = 3,4,5
anchors =  6, 14,  14, 33,  34, 63,  55,134, 108, 77,  98,162, 127,277, 280,179, 274,405
classes=2
num=9
jitter=.3
scale_x_y = 1.05
cls_normalizer=1.0
iou_normalizer=0.07
iou_loss=ciou
ignore_thresh = .7
truth_thresh = 1
random=1
resize=1.5
nms_kind=greedynms
beta_nms=0.6


[route]
layers = -3

[convolutional]
batch_normalize=1
filters=64
size=1
stride=1
pad=1
activation=leaky

[upsample]
stride=2

[route]
layers = -1, 15

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=21
activation=linear

[yolo]
mask = 0,1,2
anchors = 6, 14,  14, 33,  34, 63,  55,134, 108, 77,  98,162, 127,277, 280,179, 274,405
classes=2
num=9
jitter=.3
scale_x_y = 1.05
cls_normalizer=1.0
iou_normalizer=0.07
iou_loss=ciou
ignore_thresh = .7
truth_thresh = 1
random=1
resize=1.5
nms_kind=greedynms
beta_nms=0.6

also I am providing additional detailsif they help CUDA-version: 11040 (12020), cuDNN: 8.2.4, GPU count: 4, OpenCV version: 4.2.0, 0 : compute_capability = 750, cudnn_half = 0, GPU: Tesla T4

I saw the yolov4 git readme saying, initially train on single GPU for 1000 iterations and do the transfer learning on them, without any lead, I practically didn't try it, but curious why they said so. Also, what's going wrong here?


Solution

  • You are using an old repo that has been abandoned for several years now. This bug was fixed a while ago on the new Darknet/YOLO repo. Your options are any of the following:

    1. pin CUDA and CUDNN to an older 11.x version
    2. switch to the new Darknet/YOLO repo which contains the fix (https://github.com/hank-ai/darknet#table-of-contents)
    3. remove CUDNN and rebuild the old Darknet repo
    4. backport the fixes from the new repo to the old source code you are using (which from the output looks to be the old AlexeyAB version)