[SOLVED] How can I run a container from nvidia/cuda:12.0.1-cudnn8-runtime-ubuntu22.04 using `--gpus` option?

How can I run a container from nvidia/cuda:12.0.1-cudnn8-runtime-ubuntu22.04 using `--gpus` option?

I'm trying to run a docker container created from the image nvidia/cuda:12.0.1-cudnn8-runtime-ubuntu22.04, using Ubuntu 22.04 under WSL 2 version 1.1.3.0 in Windows 11 and Docker Desktop 4.17.1. Running lsb_release -a confirms the version of Ubuntu:

user@desktop:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.1 LTS
Release:        22.04
Codename:       jammy

In Docker Desktop, the option "Use the WSL 2 based engine" is checked in Settings -> General, as is "Enable integration with my default WSL distro" in Settings -> Resources -> WSL integration. On the same page, "Enable integration with additional distros:" is switched on for Ubuntu-22.04.

Running nvidia-smi from an Ubuntu terminal produces

user@desktop:~$ nvidia-smi
Tue Mar 21 22:43:15 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02    Driver Version: 528.49       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A200...  On   | 00000000:F3:00.0 Off |                  N/A |
| N/A   61C    P8     4W /  17W |     40MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A        32      G   /Xwayland                       N/A      |
|    0   N/A  N/A        34      G   /Xwayland                       N/A      |
|    0   N/A  N/A     13020      G   /Xwayland                       N/A      |
+-----------------------------------------------------------------------------+

For what's worth, running nvidia-smi.exe from a PowerShell terminal produces a similar but not identical result; the version of NVIDIA-SMI shows as 528.49 in Windows instead of 525.89.02 as seen above in Ubuntu.

Running the container without --gpus produces the expected result right away, i.e., a working container without GPU functionality:

user@desktop:~$ docker run -it nvidia/cuda:12.0.1-cudnn8-runtime-ubuntu22.04

==========
== CUDA ==
==========

CUDA Version 12.0.1

Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

root@80a1fd519f3a:/#

Multiple attempts to run the container with --gpus 0, --gpus 1, or --gpus all produced no output within one hour, after which I closed the terminal window - CTRL+C did not stop execution.

The outcomes above were observed also with Ubuntu 20.04 and variants of the CUDA image, such as nvidia/cuda:11.6.0-deve-ubuntu20.04 and nvidia/cuda:12.1.0-ubuntu22.04. I also tried breaking down the running of the container into separate create and start steps; the issues described still occur at the start step.

I have benefited from answers to this question, in particular the last answer, from August 2, 2020. Related questions such as pytorch cannot detect gpu in nvidia/cuda:11.3.1-cudnn8-runtime-ubuntu20.04 base image refer to issues after the container starts, but I do not get to that point.

Solution

It seems version 4.17.1 of docker-desktop is broken. CUDA containers worked fine in <=4.17.0 for me but, after upgrading to 4.17.1, the container start-up process just hangs.