I am using Nvidia V100 with the following specs:
(pytorch) [s.1915438@cl1 aneurysm]$ srun nvidia-smi
Sun Jul 17 16:17:27 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:D8:00.0 Off | 0 |
| N/A 31C P0 25W / 250W | 0MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
The Python, Pytorch and CUDA version is as follows:
Python 3.8.13 (default, Mar 28 2022, 11:38:47)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.12.0+cu113'
When I run a python file, containing a machine learning model, I get the following error.
(pytorch) [s.1915438@cl1 aneurysm]$ srun python aneurysm.py
terminate called after throwing an instance of 'std::runtime_error'
what(): the provided PTX was compiled with an unsupported toolchain.
srun: error: ccs2114: task 0: Aborted
Is it some kind of compatibility issue? Should I fallback to CUDA 10 .2 as the V100 is very old GPU?
Anyone using an old GPU from an HPC cluster is probably out of luck. In my case, I had Nvidia Driver 495 which is not very old. In fact, for CUDA 11.5 they recommend Nvidia Driver 470.
This is the official reply from Nvidia for a similar problem. They also recommend updating the driver. And most of the time HPC centres won't update the driver on personal requests.