dockertensorflowubuntucuda

Using tensorflow with GPU on Docker on Ubuntu


I've been struggling to the problem written below for many days and would like you to help me.
What I want to do is to use tensorflow with GPU on Docker on Ubuntu.
My GPU is GeForce GTX 1070, and my OS is Ubuntu 22.04.3 LTS

I've installed Docker

$ docker --version

Docker version 26.1.1, build 4cf5afa

Before I started the following, I removed every nvidia or cuda module.

$ sudo apt-get -y --purge remove nvidia*
$ sudo apt-get -y --purge remove cuda*
$ sudo apt-get -y --purge remove cudnn*
$ sudo apt-get -y --purge remove libnvidia*
$ sudo apt-get -y --purge remove libcuda*
$ sudo apt-get -y --purge remove libcudnn*
$ sudo apt-get autoremove
$ sudo apt-get autoclean
$ sudo apt-get update
$ sudo rm -rf /usr/local/cuda*
$ pip uninstall tensorflow-gpu

Afterward, I installed Nvidia driver

$ sudo apt install nvidia-driver-535

And nvidia-smi works fine.

$ nvidia-smi

Thu May 2 18:10:31 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
...

The next thing I did was to install CUDA Toolkit 12.2 Update 2 following the instruction shown below.

https://developer.nvidia.com/cuda-12-2-2-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_local

I think CUDA Toolkit 12.2 Update 2 and driver 535.104.05 are compatible according to the info shown below.

https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

And then I installed NVIDIA Container Toolkit like below

$ curl https://get.docker.com | sh \
  && sudo systemctl --now enable docker
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
            sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
            sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
$ sudo apt-get update
$ sudo apt-get install -y nvidia-container-toolkit
$ sudo nvidia-ctk runtime configure --runtime=docker
$ sudo systemctl restart docker

And next, I pulled a docker image.

$ docker pull tensorflow/tensorflow:latest-gpu

$ docker container run --rm --gpus all -it --name tf --mount type=bind,source=/home/(myname)/docker/tensorflow,target=/bindcont tensorflow/tensorflow:latest-gpu bash

In Docker container

root@a887e2a18124:/# python

Python 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

>>> import tensorflow as tf

2024-05-02 09:32:46.211605: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-05-02 09:32:46.238888: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

>>> tf.config.list_physical_devices()

2024-05-02 09:32:55.124912: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:282] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW
2024-05-02 09:32:55.124931: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:134] retrieving CUDA diagnostic information for host: 226046be5f09 2024-05-02 09:32:55.124934: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:141] hostname: 226046be5f09 2024-05-02 09:32:55.124963: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:165] libcuda reported version is: 545.23.6
2024-05-02 09:32:55.124975: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:169] kernel reported version is: 535.104.5
2024-05-02 09:32:55.124977: E external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:251] kernel version 535.104.5 does not match DSO version 545.23.6 -- cannot find working devices in this configuration
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
-- End of Message --

It seems the driver version and cuda version are inconsistent but I installed a dvriver version 535 not 545 as shown above. And I removed everything before I installed the driver-535.

Could anyone suggest what is wrong and what I should do?


My problem has not been solved yet.

I removed everything and reinstalled the Nvidia driver-545.
And followed the instruction https://github.com/NVIDIA/nvidia-docker (deprecated) and
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

This time I didn't installed CUDA Tool-kit but NVIDIA Container Toolkit.

I got from nvidia-smi
NVIDIA-SMI 545.29.06
Driver Version 545.29.06
CUDA Version 12.3

Then I ran a container

$ docker container run --rm -it --name tf --mount type=bind,source=/home/susumu/docker/tensorflow,target=/bindcont tensorflow/tensorflow:2.15.0rc1-gpu bash

When I ran sample.py, I got

# python sample.py

2024-05-02 13:46:01.669548: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used. 2024-05-02 13:46:01.689375: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-05-02 13:46:01.689395: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-05-02 13:46:01.690008: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-05-02 13:46:01.693281: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-05-02 13:46:01.693384: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-05-02 13:46:02.374705: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:274] failed call to cuInit: UNKNOWN ERROR (34)

tf.Tensor( [[1.] [1.]], shape=(2, 1), dtype=float32)

Here, sample.py is like below

# cat sample.py
import os
os.environ['TF_ENABLE_ONEDNN_OPTS'] = '0'

import tensorflow as tf
x = tf.ones(shape=(2, 1))

print(x)

As mhenning pointed out, my command lacked "--gpus all". So I added that option, and ran again.

$ docker container run --rm -it --gpus all --name tf --mount type=bind,source=/home/susumu/docker/tensorflow,target=/bindcont tensorflow/tensorflow:2.15.0rc1-gpu bash
root@112cb77313ca:/bindcont# python sample.py

2024-05-07 13:02:16.253609: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used. 2024-05-07 13:02:16.273561: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-05-07 13:02:16.273586: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-05-07 13:02:16.274202: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-05-07 13:02:16.277520: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-05-07 13:02:16.277623: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-05-07 13:02:17.104731: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2024-05-07 13:02:17.107016: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2256] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices...

tf.Tensor(  
[[1.]  
 [1.]], shape=(2, 1), dtype=float32)  

Edited on May 13, 2024

As mhenning suggested, I pulled tensorflow/tensorflow:2.14.0-gpu and tried to run sample.py again.

root@e02085a11772:/bindcont# python sample.py

2024-05-13 12:42:46.130673: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-05-13 12:42:46.130698: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-05-13 12:42:46.130737: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-05-13 12:42:46.134588: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-05-13 12:42:46.980797: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2024-05-13 12:42:46.986228: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2024-05-13 12:42:46.986414: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2024-05-13 12:42:46.987792: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2024-05-13 12:42:46.987949: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2024-05-13 12:42:46.988049: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2024-05-13 12:42:47.092104: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2024-05-13 12:42:47.092243: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2024-05-13 12:42:47.092328: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2024-05-13 12:42:47.092398: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 7329 MB memory: -> device: 0, name: NVIDIA GeForce GTX 1070, pci bus id: 0000:04:00.0, compute capability: 6.1 tf.Tensor( [[1.] [1.]], shape=(2, 1), dtype=float32)

It seems my GPU works properly!

root@bce19cf9ec80:/# python -c "import tensorflow as tf;print(tf.sysconfig.get_build_info())"

2024-05-13 12:46:41.451323: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2024-05-13 12:46:41.471928: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-05-13 12:46:41.471950: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-05-13 12:46:41.471964: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-05-13 12:46:41.475902: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. OrderedDict([('cpu_compiler', '/usr/lib/llvm-16/bin/clang'), ('cuda_compute_capabilities', ['sm_35', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'compute_80']), ('cuda_version', '11.8'), ('cudnn_version', '8'), ('is_cuda_build', True), ('is_rocm_build', False), ('is_tensorrt_build', True)])

By the way, when I checked my BIOS setting, secure boot was off, so it had nothing to do with my trouble.


Solution

  • It seems that for docker container you don't need to install CUDA drivers on the host system. From the link:

    Docker is the easiest way to enable TensorFlow GPU support on Linux since only the NVIDIA® GPU driver is required on the host machine (the NVIDIA® CUDA® Toolkit does not need to be installed).

    This TF container installs CUDA 12.3 inside (you can see it in the image layers list here), and according to this table (table 3 in the NVIDIA CUDA Toolkit Release Notes) CUDA 12.3 needs nvidia drivers >=545 (which is the missmatch in the error stack). This is a bit counterintuitive to other tables where the minimum requirements for CUDA 12 is just driver versions >=525. From the link:

    CUDA Toolkit        Toolkit Driver Version  
                        Linux x86_64 Driver Version   Windows x86_64 Driver Version
    CUDA 12.4 Update 1  >=550.54.15                   >=551.78
    CUDA 12.4 GA        >=550.54.14                   >=551.61
    CUDA 12.3 Update 1  >=545.23.08                   >=546.12
    CUDA 12.3 GA        >=545.23.06 <- this one       >=545.84
    CUDA 12.2 Update 2  >=535.104.05                  >=537.13
    ...
    

    The easiest way would be to update your drivers to version >=545. It seems that for a 1070, the latest drivers are 550, so you should be fine version-wise.
    Alternatively, you could use the last docker image with CUDA 11.8, which would be tensorflow:2.15.0rc1-gpu (This is wrong, see edit). All other gpu images after this seems to use CUDA 12.3 when looking at the definitions in the image layers.


    Edit: I was wrong with the tensorflow tag I recommended, this was indeed the worst version to recommend. It indeed installs CUDA 11.8, which one can check with nvcc --version, but when you check the required CUDA version with tf.sysconfig.get_build_info() (or one line with):

    python3 -c "import tensorflow as tf;print(tf.sysconfig.get_build_info())"
    

    you get the following with tensorflow:2.15.0rc1-gpu :

    OrderedDict([('cpu_compiler', '/usr/lib/llvm-17/bin/clang'), ('cuda_compute_capabilities', ['sm_50', 'sm_60', 'sm_70', 'sm_75', 'compute_80']), ('cuda_version', '12.2'), ('cudnn_version', '8'), ('is_cuda_build', True), ('is_rocm_build', False), ('is_tensorrt_build', True)])

    Here you can see that TF expects CUDA 12.2, which is obviously not installed, voiding the GPU capabilities of this container.

    I checked with the other tags, for CUDA 11.8 you can use the tag tensorflow/tensorflow:2.14.0-gpu, and for CUDA 12.2 you can use tensorflow/tensorflow:2.15.0-gpu (non-rc version) and above.