cudajitnvcccuda-drivercuda-jit-cache

Is NVIDIA's JIT compilation cache used when you don't use NVCC?


As we should all know (but not enough people do), when you build a CUDA program with NVCC, and run it on a device for which fully-compiled (SASS) code for the specific device is not included in the binary - the intermediate PTX code is JITed, and the result is actually used for running your kernels. During this JITing, a JIT compilation cache kicks in, so that, next time you run the same executable, the compilation can be skipped in favor of just loading the result.

Now, suppose I'm writing C++ file which compiles a kernel dynamically, at run-time, rather than using NVCC, e.g.:

  1. I use NVRTC's nvrtcCompileProgram() to compile CUDA C++ code, targeting a concrete architecture (e.g. sm_70).
  2. I use the CUDA driver's cuModuleLoad() to load a PTX file with the kernel.

will the compilation result be placed in that cache?


Solution

  • The caching behaviour you are describing has nothing to do with either nvcc or nvrtc. The caching of runtime JIT compiled code is a driver level mechanism which is provided primarily for implementing compatibility of newer hardware with older code.

    There are exactly three cases to consider when running CUDA code using either the runtime or driver API to run a kernel:

    1. The application provides compatible SASS to the driver (be that a statically linked payload in a runtime API application, or SASS loaded from a file, or SASS emitted by using nvrtc with a physical architecture as a target). In this case the SASS is loaded and executed. No caching is involved.

    2. The application provides valid PTX code (be that from a fatbinary payload in the case where there is no compatible SASS present, or loaded via the driver API, whatever the source of that payload is, which includes nvrtc in the case where a virtual architecture is used as a target). In this case the driver triggers JIT compilation of the PTX and loads the results SASS to execute. This is where caching occurs. The driver will check the user specific private cache of the JIT output, if it exists and if it finds a match to PTX it has previously compiled, it retrieves the SASS from the cache and uses it rather than compile the same PTX again. This mechanism can be defeated by setting CUDA_CACHE_DISABLE to 1. A fuller discussion of this mechanism and its controls can be found here. If the PTX is invalid, an invalid (or incompatible) PTX error message will be returned to the caller and execution fails

    3. The application provides neither compatible SASS, nor PTX. In this case a no binary for GPU (or its runtime API equivalent) error will be returned to the caller and execution fails. The driver PTX cache plays no role in this case.

    So to your two scenarios:

    I use NVRTC's nvrtcCompileProgram() to compile CUDA C++ code, targeting a concrete architecture (e.g. sm_70).

    In this scenario, you fall into the first or third cases above. The binary payload will be loaded and executed if valid, or fail with an error if invalid. No caching occurs.

    I use the CUDA driver's cuModuleLoad() to load a PTX file with the kernel.

    In this scenario case 2 applies. The driver does a cache check and either reuses a previous JIT pass output from the cache, or attempts to perform a JIT compile and cache the results if a cache miss occurs. If the PTX is valid and compatible, the kernel runs.