CudaSetValidDevices doesn't seem to work as expected

I m using a Nvidia machine which has 2 Tesla V100 GPUs. I used cudaSetValidDevices API to set only 1 valid device which is device 1. After that if I'm trying to set device 0, the API still seem to work without returning error. Is this expected behavior? I was thinking both cudaSetDevice and cudaGetDeviceProperties for device 0 should return error code like cudaErrorInvalidDevice. I tried to compile the program with 12.9 nvcc compiler.

#include <iostream>
#include <cuda_runtime.h>

int main() {
  int valid_devices[] = {1};  // Allow only device 1
  cudaError_t err = cudaSetValidDevices(valid_devices, 1);
  std::cout << "Set valid devices: " << cudaGetErrorString(err) << "\n";
  
  int count = 0;
  cudaGetDeviceCount(&count);
  std::cout << "Device count: " << count << "\n";


  // Try to use a disallowed device
  err = cudaSetDevice(0);
  std::cout << "Set device 0: " << cudaGetErrorString(err) << "\n";  

  cudaDeviceProp props;
  err = cudaGetDeviceProperties(&props, 0);
  std::cout << "cudaGetDeviceProperties : " << cudaGetErrorString(err) << "\n";

  std::cout << "total Global memory  : " << props.totalGlobalMem << "\n";
}

And the output for this looks as below.

Set valid devices: no error
Device count: 2
Set device 0: no error
cudaGetDeviceProperties : no error
total Global memory  : 34072559616

Solution

Shorter: the cudaSetValidDevices() function affects the behavior of the CUDA runtime when no device is explicitly selected. It provides a "list" of devices to try/use when no device has been previously selected in a thread, eg. by cudaSetDevice(). cudaSetValidDevices() does not prevent or disallow usage of any "enumeratable" device, however, if for example cudaSetDevice() is used.

Longer: The CUDA runtime has a per-host-thread notion of the "currently selected" GPU in a multi-GPU machine. This pervades the CUDA multi-GPU programming model: kernel launches will be issued to the currently selected device (there is no device selector in the kernel launch syntax) and similarly for other kinds of work issuance to the device (cudaMemcpy calls, etc.).

If you have not made any device selection in a particular host thread, by default the selected device will be the one enumerated as zero (assuming it is usable). A device may be "enumeratable" but also "unusable" if, for example, its compute mode is set to Exclusive Process and another process is currently using that device. In that case the CUDA runtime will detect this condition and select another enumeratable device. A device will not be enumerated by the CUDA runtime if, for example, it is excluded via CUDA_VISIBLE_DEVICES or similar (this kind of exclusion affects enumeration - an excluded device is not enumerated). The subsequent discussion will assume that none of these conditions apply and the default device is the device enumerated as 0.

So without any other machinery, given the above preamble, a kernel launch will be issued to device 0. Of course you can change this behavior with cudaSetDevice(). However it is also possible to change the "default selection process" that the CUDA runtime uses by listing a set and order of devices that are to be used - via cudaSetValidDevices(). Instead of attempting to use device enumerated as zero, if this function call is made, the list passed will indicate the devices to be tried/used, starting with the first one in the list. The list specifies devices by their enumeration order. So a list of {1,2,3,0} would tell the CUDA runtime that, if no device is currently selected in this thread, the first device to "try" to use would be device 1, not 0. If device 1 is not usable for the reason indicated above, the CUDA runtime will then proceed to device 2, then 3, and then finally 0.

Any explicit selection of a device ordinal via an API call, such as cudaSetDevice() or cudaGetDeviceProperties() will ignore this list or anything set via cudaSetValidDevices().

cudaGetDevice() always indicates the currently selected device for a given thread, so even in the absence of any previous call to cudaSetDevice(), it will reflect the device selected by the heuristic of the CUDA runtime as described above. Likewise, if there is no previous call to cudaSetDevice() but there is a previous call to cudaSetValidDevices in that thread, then the cudaGetDevice() call will likewise reflect those semantics (the choice made by the CUDA runtime given that list).