python-3.xgputensorflow2.0nvidia

tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error


I am trying to use GPU with Tensorflow. My Tensorflow version is 2.4.1 and I am using Cuda version 11.2. Here is the output of nvidia-smi.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39       Driver Version: 460.39       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce MX110       Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   52C    P0    N/A /  N/A |    254MiB /  2004MiB |      8%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1151      G   /usr/lib/xorg/Xorg                 37MiB |
|    0   N/A  N/A      1654      G   /usr/lib/xorg/Xorg                136MiB |
|    0   N/A  N/A      1830      G   /usr/bin/gnome-shell               68MiB |
|    0   N/A  N/A      5443      G   /usr/lib/firefox/firefox            0MiB |
|    0   N/A  N/A      5659      G   /usr/lib/firefox/firefox            0MiB |
+-----------------------------------------------------------------------------+

I am facing a strange issue. Previously when I was trying to list all the physical devices using tf.config.list_physical_devices() it was identifying one cpu and one gpu. AFter that I tried to do a simple matrix multiplication on the GPU. It failed with this error : failed to synchronize cuda stream CUDA_LAUNCH_ERROR (the error code was something like that, I forgot to note it). But after that when I again tried the same thing from another terminal, it failed to recognise any GPU. This time, listing physical devices produce this:

>>> tf.config.list_physical_devices()
2021-04-11 18:56:47.504776: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-11 18:56:47.507646: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-04-11 18:56:47.534189: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2021-04-11 18:56:47.534233: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: debadri-HP-Laptop-15g-dr0xxx
2021-04-11 18:56:47.534244: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: debadri-HP-Laptop-15g-dr0xxx
2021-04-11 18:56:47.534356: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 460.39.0
2021-04-11 18:56:47.534393: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 460.39.0
2021-04-11 18:56:47.534404: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 460.39.0
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]

My OS is Ubuntu 20.04, Python version 3.8.5 and Tensorflow , as mentioned before 2.4.1 with Cuda version 11.2. I installed cuda from these instructions. One additional piece of information; when I import tensorflow , it shows the following output:

import tensorflow as tf
2021-04-11 18:56:07.716683: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0

What am I missing? Why is it failing to recognise the GPU even though it was recognising previously?


Solution

  • tldr: Disable Secure Boot before installing the Nvidia Driver.

    I had the exact same error, and I spent a ton of time trying to figure out if I had installed Tensorflow related stuff incorrectly. After many hours of problem solving, I found that my NVIDIA driver was having some problems because I never disabled secure boot in my BIOS when setting up Ubuntu 20.4. Here's what I suggest (I opted for using Docker w/ Tensorflow, which avoids having to install all theCuda related stuff) - I hope it works for you!

    1. Disable Secure Boot in your BIOS
    2. Make a fresh install on Ubuntu 20.4
    3. Install Docker according to nvidia-container-toolkit's page.
    curl https://get.docker.com | sh \
      && sudo systemctl --now enable docker
    
    1. Install nvidia-container-toolkit from the same page.
    distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
       && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
       && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
    
    sudo apt-get update
    
    sudo apt-get install -y nvidia-docker2
    
    sudo systemctl restart docker
    
    1. Test to make sure that's working with
    sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
    
    1. Finally, use Tensorflow with Docker w/ GPU support!
    docker run --gpus all -u $(id -u):$(id -g) -it -p 8888:8888 tensorflow/tensorflow:latest-gpu-jupyter jupyter notebook --ip=0.0.0.0