cudagpunvidiaslurmmulti-gpu

Can not find GPU devices in a data center node


I am new to finding my way around multi node datacenters. And the following thing is happening to me.

First I use the program from this answer to check for CUDA devices. I build it (I had some problems there but that is matter for another question) and the executable is called device_info8.

So I login into my datacenter, and from the login node, I run the file

me@login01 test]$ ./device_info8 
Number of devices: 1
Device Number: 0
  Device name: Tesla V100-PCIE-16GB
  Memory Clock Rate (MHz): 856
  Memory Bus Width (bits): 4096
  Peak Memory Bandwidth (GB/s): 898.0
  Total global memory (Gbytes) 15.8
  Shared memory per block (Kbytes) 48.0
  minor-major: 0-7
  Warp-size: 32
  Concurrent kernels: yes
  Concurrent computation/communication: yes

I don't have direct access to the node I want to test so I do

me@login01 test]$ srun -p partition1 --nodelist Node-11 --gres=gpu:all   --pty -u bash -i  
[me@Node-11 test]$

and now I do

[me@Node-11 test]$./device_info8
Number of devices: 0

However when I run nvidia-smi I can clearly see that I have 8 GPUs available!

[me@Node-11 test]$ nvidia-smi 
Tue Dec  3 18:16:04 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:2D:00.0 Off |                    0 |
| N/A   28C    P0    26W / 250W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000000:31:00.0 Off |                    0 |
| N/A   26C    P0    25W / 250W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  On   | 00000000:35:00.0 Off |                    0 |
| N/A   26C    P0    25W / 250W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  On   | 00000000:39:00.0 Off |                    0 |
| N/A   27C    P0    24W / 250W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-PCIE...  On   | 00000000:A9:00.0 Off |                    0 |
| N/A   26C    P0    26W / 250W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-PCIE...  On   | 00000000:AD:00.0 Off |                    0 |
| N/A   29C    P0    25W / 250W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-PCIE...  On   | 00000000:B1:00.0 Off |                    0 |
| N/A   27C    P0    24W / 250W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-PCIE...  On   | 00000000:B5:00.0 Off |                    0 |
| N/A   28C    P0    27W / 250W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Why is this happening and what am I overlooking? How can I make the GPUs available to the program?


Solution

  • The Slurm documentation does not mention the possibility of writing --gres=gpu:all, and when I do on my system, I get an error. Try specifying an actual number instead of all and look at the value of the CUDA_VISIBLE_DEVICES variable. It should not be empty. If it is, it means that Slurm has not understood or honoured the request for GPUs