dockerubuntucudagpunvidia-docker

nvidia-docker2 general question about capabilities


so using a regular docker I came to a conclusion that 2 different CUDA versions aren't compatible for the following run concept: use the local GPU with CUDA 11 for example with the docker environment with lower OS version and lower CUDA version, because the container must approach the local GPU thorough its CUDA, and because they aren't compatible, the whole thing is impossible.

Is this exactly the issue nvidia-docker2 is addressing ?

Suppose my OS is ubuntu 20+, CUDA 11+ and I need to run code that must run with CUDA 8 which is only compatible with UBUNTU 16 and I have another code that compatible with CUDA 10 on Ubuntu 18.

As much as I saw and correct me if I'm wrong, nvidia-docker2 would make me being able to run nvidia-smi command on the container itself, thus the container simulates ("thinks") that the gpu is local to it, thus I can create one container with ubuntu 16, another one with 18, and my GPU will happily participate with any CUDA, cudatoolkit and cudnn versions as I want (install on the containers) ? I think it was also written that those components can be only in the containers, thus it doesn't matter what CUDA version I have on my computer, am I wrong ?

And if that is the case, another question will be, would I be able with docker and cuda-container-toolkit to run the interpreter from the container as I can do in the moment using docker and PyCharm, i.e does it support this functionality additionally for being able to run different CUDAs on different containers ?

Or am I wrong and hoped to optimistically that it is possible to debug different docer environments with incompatible cuda versions with the same local GPU without installing diffenet UBUNTU versions on the hard drive ?

Or does the last suggestion is the only one possible (few Ubuntus on the same computer)? Sounds as the most confident and easy solution anyway, but correct me where I am wrong.


Solution

  • Is this exactly the issue nvidia-docker2 is addressing ?

    The primary issue has to do with the GPU driver. The GPU driver has components that run in kernel space and other components that run in user space. The implication of this is that for successful usage in docker, these components (user-space: inside the container, kernel space: outside the container) must match.

    That is a key function for the NVIDIA container toolkit/container runtime that augments docker: To make whatever is inside the container pertaining to the GPU driver match whatever is outside the container.

    Other aspects of the CUDA toolkit (runtime libraries, nvcc, etc.) are separate, and regardless of whether you use the NVIDIA container toolkit or not, the code inside the container will need whatever it uses of that (e.g. runtime libraries, nvcc, etc.) to be present inside the container. The stuff outside the container for these items is irrelevant (unless, of course, you are providing it via a mount from outside to inside).

    Apart from all that, CUDA itself has a dependency between the CUDA version of the toolkit, and the driver. In a nutshell, in ordinary usage, the CUDA version that is in the container must be a version that can be supported by the driver. Newer drivers support older toolkits. Older drivers do not support newer toolkits, unless you take special measures.

    Related resources:

    1 2 3 4 5 6 7

    To have the most flexibility in your setup, make sure you have the latest GPU driver installed in your base machine. And use the NVIDIA container toolkit. "Older" CUDA toolkits/docker containers should run fine in that setup.

    General recommendations from NVIDIA include:

    1. Don't install the driver in the container. Make sure there is a proper driver install in the base machine, and install/use only the CUDA toolkit (without driver) components, in the container itself.

    2. Use the nvidia container toolkit to make the container "harmonize" with the driver installed in the base machine.

    3. It's usually a good idea to have the latest GPU driver installed in the base machine. This should work with all containers that use CUDA.

    4. If you wish to provision the driver in the base machine using a containerized method, (such as one might do in a kubernetes cluster), there is the GPU operator for that. But this does not install the driver in the container, it installs it in the base machine, using a containerized delivery method.