cudacompute-capability

CUDA -arch for older GPUs while only compiling host code calling CUDA APIs or third party libs


Assume the CUDA version installed only supports my old GPU when -arch=sm_35 is passed. Otherwise, kernels do not execute.

Suppose I now only call CUDA Runtime APIs (cudaMalloc, cudaFree, etc) in my C++ code and then compile with nvcc without passing -arch. Why do these operations still proceed correctly?

It also appears that calls to CUDA-enabled libraries (CUDA-aware MPI, for example) operate correctly if they are built with the correct arch. My code linking to these libraries does not need -arch.

Is this a correct observation? If so, does this apply to cuda, cuBLAS, cuSPARSE, etc? In other words, can I pass -lcublas or -lcusparse with code that only use the libraries without passing -arch?


Solution

  • When you specify an architecture specification to nvcc, you are instructing nvcc about what to do when compiling device code. Device code is code that you wrote, or included in source form, that is marked with either the __global__ or __device__ decorator.

    The CUDA runtime API (and driver API) do not fit this description. Their behavior doesn't depend on your arch specification, which is used when compiling device code. The runtime and driver API represent calls into pre-compiled libraries, so nvcc doesn't do any device code compilation there.

    If you wrote a "CUDA code" that only depended on these runtime or driver API libraries (such as deviceQuery) then they will not be dependent on any arch specification. For the same reason, such applications can actually be built without using nvcc at all.

    For other libraries, I would draw a distinction between fully pre-compiled libraries, and "template" libraries. (That distinction is not unique or specific to CUDA.) A template library has source code that gets included in your compilation process. Examples of template libraries in CUDA include thrust and CUTLASS. When using either of these libraries, you generally will need to compile your code with nvcc and the arch specification will be important.

    Examples of fully pre-compiled libraries in CUDA are CUFFT, CUBLAS, CUSOLVER. Device code in those libraries has already been compiled, and you do not get to recompile that code, so the arch specification is meaningless. Note that some of these libraries have derivatives (such as cufftDx) which are template libraries. I won't be able to provide an exhaustive list of all CUDA libraries. It would become out-of-date anyway. You can refer to sample codes for various libraries to see how they are built. Normally, with a bit of experience in template vs. pre-compiled libraries, you will learn how to differentiate. When compiling thrust code for example, you include thrust headers but there are no thrust libraries to link to at link time. This is a strong indication you are dealing with a template library.

    Whether you are in doubt about any of this or not, when using nvcc, the usual suggestion is to make sure to compile with an arch specification that matches the device(s) you intend to run on. That will work whether the specification is actually needed, or not.