pythonpytorchbuildcuda

PyTorch problem with a specific version of CUDA


Background

I need to test this AI model on the following CUDA server:

https://github.com/sicxu/Deep3DFaceRecon_pytorch

$ nvidia-smi 
Tue Jun 18 18:28:37 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060        On  | 00000000:41:00.0 Off |                  N/A |
|  0%   40C    P8              13W / 170W |     39MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1078      G   /usr/lib/xorg/Xorg                           16MiB |
|    0   N/A  N/A      1407      G   /usr/bin/gnome-shell                          3MiB |
+---------------------------------------------------------------------------------------+

But I'm receiving this warning while testing:

/home/arisa/.conda/envs/deep3d_pytorch/lib/python3.6/site-packages/torch/cuda/init.py:125: UserWarning: NVIDIA GeForce RTX 3060 with CUDA capability sm_86 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37. If you want to use the NVIDIA GeForce RTX 3060 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

I receive this error after the above warning:

RuntimeError: CUDA error: no kernel image is available for execution on the device

Note: CUDA 12.2 vs 12.3

I was able to test the same AI model on Google Colab with CUDA 12.2 without any problem. I'm not sure why the server with CUDA 12.3 is a trouble maker.

Google Colab screenshot of CUDA version

Why?

Why CUDA 12.2 works fine but 12.3 throws warnings and errors?

Building from source

So, I just thought I would build PyTorch 1.6.0 - required by the AI model - with CUDA 12.3. I just wanted to ask about the possibility of building from source. I just want to know if it's possible to build PyTorch 1.6.0 along with CUDA 12.3 without patching the source code:

https://github.com/pytorch/pytorch/releases/tag/v1.6.0


Solution

  • As commented by @talonmies :

    This has nothing to do with “CUDA versions”, which is irrelevant. The problem is clearly described in the error message - the Colab case is using a Tesla T4 (compute capability 7.5) for which the Pytorch build you have includes binary support, whereas the other GPU is a compute capability 8.6 device and there is no binary support in the same Pytorch build. There is nothing to do except get a build of the PyTorch version you want to use with compute 8.6 binary support included, if such a thing exists

    Plan

    I'm going to find the first/earliest versions of PyTorch having compute capability 8.6 and then test them to see if the AI model can be run with them...