cudajitltonvrtc

How to use the option CU_JIT_LTO with CUDA JIT linking?


I'm wondering if I can improve the link time optimization (LTO) during just-in-time (JIT) linking with the option CU_JIT_LTO. If so, how do I specify this option?

I found the following code in an NVIDIA developer blog, but I don't understand why walltime is given to CU_JIT_LTO. The walltime variable is not defined in the blog. When I tried something similar, it had no effect on my kernel performance.

options[0] = CU_JIT_LTO;
values[0] = (void*)&walltime;
...
cuLinkCreate(..., options, values, &linkState);

source: https://developer.nvidia.com/blog/discovering-new-features-in-cuda-11-4/

My example case uses the input option CU_JIT_INPUT_NVVM to link objects which were created with LTO flags (-dlto or -code=lto_80). It seems the linker already does some LTO because the kernel "performs better" than linking object files without LTO, but not as good as linking with LTO using NVCC. (see example case for detailed results and discussion)

Example case

To check the effectiveness of the link time optimization (LTO), I created a simple program using 4 different methods and used the number of registers per thread as an indicator. This gives the following result on my system (OS: ubuntu 20.04, CUDA toolkit: 11.5.1, NVIDIA driver: 495.44, GPU: NVIDIA RTX 3080).

                       method                                registers/thread
Create program using a single translation file           :         30
Link files using NVCC without link time optimization     :         44
Link files using NVCC with link time optimization        :         30
Link files using NVRTC/JIT with link time optimization   :         38

Result interpretation:

Creating the program from a single translation file should give the best possible result. The compiler can see all the function implementations and use it to optimize the kernel. This results in 30 registers/thread.

Linking with LTO using NVCC definitely works. It uses the same number of registers (30) as the program compiled from a single .cu file, which is not the case without LTO (uses 44 registers).

Linking files using NVRTC/JIT does a "better job" (when we only focus on register usage) than linking with NVCC without LTO, but not as good as linking with NVCC with LTO. The kernel uses 38 registers/thread.

Note: My goal is not to reduce the register usage, I only use it as an indicator. Because the program from a single translation file uses 30 registers/thread, I assume that a fully optimized linked program would have the same "final executable code" and thus use the same amount of registers. Because this is not the case, I started looking at the JIT options.

CU_JIT_LTO option:

I tried to further optimize the linking in the NVRTC/JIT case with the JIT_option CU_JIT_LTO. However, I'm not sure how to use this option. I tried it in the following two ways (see the file cuda code below for more context. The code for linking starts at line 41):

METHOD 1: Add the option CU_JIT_LTO to cuLinkCreate(...). This seems to have no effect. The code uses the same number of registers when int lto = 0 and int lto = 1.

METHOD 2: Add the option CU_JIT_LTO to cuLinkAddFile(...) and cuLinkAddData(...). This immedialely gives the error CUDA_ERROR_INVALID_VALUE.

So now my question: How should the option CU_JIT_LTO be used?

Files:

Below are two files. Follow these steps to run the example (on a linux OS):

  1. Save the cuda code in a file with .cu extension (e.g.: code.cu)
  2. Save the bash script in a file (e.g.: run.sh)
  3. Run this command from a terminal: bash run.sh code.cu

cuda code:

#include <iostream>
#include <stdio.h>

#ifdef RTC
#include <cuda.h>
#include <nvrtc.h>
#define NVRTC_CHECK(x)                                                                            \
  do {                                                                                            \
    nvrtcResult result = x;                                                                       \
    if (result != NVRTC_SUCCESS) {                                                                \
      std::cerr << "\nerror: " #x " failed with error " << nvrtcGetErrorString(result) << '\n';   \
      exit(1);                                                                                    \
    }                                                                                             \
  } while (0)
#define CUDA_CHECK(x)                                                                             \
  do {                                                                                            \
    CUresult result = x;                                                                          \
    if (result != CUDA_SUCCESS) {                                                                 \
      const char* msg;                                                                            \
      cuGetErrorName(result, &msg);                                                               \
      std::cerr << "\nerror: " #x " failed with error " << msg << '\n';                           \
      exit(1);                                                                                    \
    }                                                                                             \
  } while (0)

CUmodule compileModule(std::string program)
{
  // Compile nvvm from program string ===============
  nvrtcProgram prog;
  NVRTC_CHECK(nvrtcCreateProgram(&prog, program.c_str(), "programRTC.cu", 0, NULL, NULL));

  const char* opts[] = {"-arch=compute_80", "-dlto", "-dc"};
  nvrtcResult compileResult = nvrtcCompileProgram(prog, 3, opts);

  // Obtain NVVM from the program.
  size_t nvvmSize;
  NVRTC_CHECK(nvrtcGetNVVMSize(prog, &nvvmSize));
  char* nvvm = new char[nvvmSize];
  NVRTC_CHECK(nvrtcGetNVVM(prog, nvvm));

  // Link files ===============
  CUlinkState linker;

  // ARE THE OPTIONS SPECIFIED CORRECTLY?
  int lto = 1;
  CUjit_option options[] = {CU_JIT_LTO};
  void* values[] = {(void*)&lto};

  // METHOD 1: GIVE THE OPTIONS TO 'cuLinkCreate(...)'
  //           -> HAS NO EFFECT ON THE AMOUNT OF REGISTERS USED
  // -------------------------------------------------------------------------------------------
  // CUDA_CHECK(cuLinkCreate(0, NULL, NULL, &linker));
  CUDA_CHECK(cuLinkCreate(1, options, values, &linker));
  // -------------------------------------------------------------------------------------------


  // METHOD 2: GIVE THE OPTIONS TO 'cuLinkAddFile(...)' and 'cuLinkAddData(...)'
  //           -> FUNCTION FAILS WITH ERROR 'CUDA_ERROR_INVALID_VALUE'
  // -------------------------------------------------------------------------------------------
  CUDA_CHECK(cuLinkAddFile(linker, CU_JIT_INPUT_NVVM, "func_link_nvrtc_lto.o", 0, NULL, NULL));
  CUDA_CHECK(cuLinkAddData(linker, CU_JIT_INPUT_NVVM, (void*)nvvm, nvvmSize, "programRTC.o", 0,
                           NULL, NULL));

  // CUDA_CHECK(cuLinkAddFile(linker, CU_JIT_INPUT_NVVM, "func_link_nvrtc_lto.o", 1, options, values));
  // CUDA_CHECK(cuLinkAddData(linker, CU_JIT_INPUT_NVVM, (void*)nvvm, nvvmSize, "programRTC.o", 1,
  //                          options, values));
  // -------------------------------------------------------------------------------------------

  // Create module ===============
  void* cubin;
  CUmodule module;
  CUDA_CHECK(cuLinkComplete(linker, &cubin, NULL));
  CUDA_CHECK(cuModuleLoadDataEx(&module, cubin, 0, NULL, NULL));

  // Cleanup
  NVRTC_CHECK(nvrtcDestroyProgram(&prog));
  CUDA_CHECK(cuLinkDestroy(linker));

  return module;
}
#endif // RTC

__device__ double func(double a, double b);
#ifdef FUNC
__device__ double func(double a, double b)
{
  return pow(a, b);
}
#endif

#ifdef MAIN
#ifdef RTC
std::string the_program = R"===(
__device__ double func(double a, double b);

extern "C" __global__ void kernel(double* out, double* a, double* b)
{
  size_t tid = blockIdx.x * blockDim.x + threadIdx.x;
  if(tid >= 1){
    return;
  }
  a[tid] = 2;
  b[tid] = 3;
  out[tid] = func(a[tid], b[tid]);
  printf("out[%lu] = %f\n", tid, out[tid]);
})===";
#else  // RTC
__global__ void kernel(double* out, double* a, double* b)
{
  size_t tid = blockIdx.x * blockDim.x + threadIdx.x;
  if (tid >= 1) {
    return;
  }
  a[tid] = 2;
  b[tid] = 3;
  out[tid] = func(a[tid], b[tid]);
  printf("out[%lu] = %f\n", tid, out[tid]);
}
#endif // RTC

int main()
{
  double* a;
  double* b;
  double* out;
  cudaMalloc((void**)&a, sizeof(double));
  cudaMalloc((void**)&b, sizeof(double));
  cudaMalloc((void**)&out, sizeof(double));

#ifdef RTC
  // Create context
  CUdevice cuDevice;
  CUcontext context;
  CUDA_CHECK(cuInit(0));
  CUDA_CHECK(cuDeviceGet(&cuDevice, 0));
  CUDA_CHECK(cuCtxCreate(&context, 0, cuDevice));

  CUmodule module = compileModule(the_program);

  CUfunction kernel;
  CUDA_CHECK(cuModuleGetFunction(&kernel, module, "kernel"));

  size_t n_blocks = 1;
  size_t n_threads = 1;
  void* args[] = {&out, &a, &b};
  CUDA_CHECK(cuLaunchKernel(kernel, n_blocks, 1, 1, // grid dim
                            n_threads, 1, 1,        // block dim
                            0, NULL,                // shared mem and stream
                            args, 0));              // arguments
  CUDA_CHECK(cuCtxSynchronize());

  // Cleanup
  CUDA_CHECK(cuModuleUnload(module));
  CUDA_CHECK(cuCtxDestroy(context));
#else  // RTC
  kernel<<<1, 1>>>(out, a, b);
  cudaDeviceSynchronize();
#endif // RTC

  return 0;
}
#endif // MAIN

bash script:

#!/bin/bash

set -e # stop script when an error occurs

SCRIPT=$1
xCCx=80 # CUDA compute compatibility

# Create program using a single translation file
echo -e "\n---------- main_single ----------\n"
nvcc -DFUNC -DMAIN $SCRIPT -o main_single \
    -gencode arch=compute_$xCCx,code=sm_$xCCx
./main_single # should print 'out[0] = 8.0'
cuobjdump main_single -res-usage | grep kernel -A1

# Link files using NVCC without link time optimization (code=compute_...)
echo -e "\n---------- main_link_nvcc ----------\n"
nvcc -DFUNC $SCRIPT -o func_link_nvcc.o -dc \
    -gencode arch=compute_$xCCx,code=compute_$xCCx
nvcc -DMAIN $SCRIPT -o main_link_nvcc.o -dc \
    -gencode arch=compute_$xCCx,code=compute_$xCCx
nvcc func_link_nvcc.o main_link_nvcc.o -o main_link_nvcc \
    -gencode arch=compute_$xCCx,code=sm_$xCCx
./main_link_nvcc # should print 'out[0] = 8.0'
cuobjdump main_link_nvcc -res-usage | grep kernel -A1

# Link files using NVCC with link time optimization (code=lto_...)
echo -e "\n---------- main_link_nvcc_lto ----------\n"
nvcc -DFUNC $SCRIPT -o func_link_nvcc_lto.o -dc \
    -gencode arch=compute_$xCCx,code=lto_$xCCx
nvcc -DMAIN $SCRIPT -o main_link_nvcc_lto.o -dc \
    -gencode arch=compute_$xCCx,code=lto_$xCCx
nvcc func_link_nvcc_lto.o main_link_nvcc_lto.o -o main_link_nvcc_lto -dlto \
    -gencode arch=compute_$xCCx,code=sm_$xCCx
./main_link_nvcc_lto # should print 'out[0] = 8.0'
cuobjdump main_link_nvcc_lto -res-usage | grep kernel -A1

# Link files using NVRTC with link time optimization
echo -e "\n---------- main_link_nvrtc_lto ----------\n"
nvcc -DFUNC $SCRIPT -o func_link_nvrtc_lto.o -dc -ptx \
    -gencode arch=compute_$xCCx,code=lto_$xCCx
nvcc -DMAIN -DRTC $SCRIPT -o main_link_nvrtc_lto \
    -lnvrtc_static -lnvrtc-builtins_static -lnvptxcompiler_static -lcuda -lpthread \
    -gencode arch=compute_$xCCx,code=sm_$xCCx
./main_link_nvrtc_lto # should print 'out[0] = 8.0'
ncu main_link_nvrtc_lto | grep register/thread


# Registers/thread used on my system with an NVIDIA RTX 3080:
# main_single          : 30 registers/thread
# main_link_nvcc       : 44 registers/thread
# main_link_nvcc_lto   : 30 registers/thread
# main_link_nvrtc_lto  : 38 registers/thread

Subquestion: generate an NVVM IR file

To procude the file func_link_nvrtc_lto.o which works with the command cuLinkAddFile(linker, CU_JIT_INPUT_NVVM, "func_link_nvrtc_lto.o", ...), I had to add the -ptx flag as shown in the command below. I did not find this in any documentation but by try-and-error. I was wondering if there is a better way to produce such a file.

nvcc -DFUNC $SCRIPT -o func_link_nvrtc_lto.o -dc -ptx \
-gencode arch=compute_$xCCx,code=lto_$xCCx

Update CUDA 12.x

With the release of CUDA 12.0 which introduces the nvJitLink library, this question has become irrelevant. Because the cuLinkCreate which could accept the option CU_JIT_LTO is now deprecated.

For those interested, below are the modified scripts and results with CUDA 12.0.

                       method                                registers/thread
Create program using a single translation file                :        30
Link files using NVCC without link time optimization          :        44
Link files using NVCC with link time optimization             :        30
Link files using NVRTC/nvJitLink with link time optimization  :        32

cuda code:

#include <iostream>
#include <stdio.h>

#ifdef RTC
#include <cuda.h>
#include <nvrtc.h>
#include <nvJitLink.h>
#define NVRTC_CHECK(x)                                                                            \
  do {                                                                                            \
    nvrtcResult result = x;                                                                       \
    if (result != NVRTC_SUCCESS) {                                                                \
      std::cerr << "\nerror: " #x " failed with error " << nvrtcGetErrorString(result) << '\n';   \
      exit(1);                                                                                    \
    }                                                                                             \
  } while (0)
#define CUDA_CHECK(x)                                                                             \
  do {                                                                                            \
    CUresult result = x;                                                                          \
    if (result != CUDA_SUCCESS) {                                                                 \
      const char* msg;                                                                            \
      cuGetErrorName(result, &msg);                                                               \
      std::cerr << "\nerror: " #x " failed with error " << msg << '\n';                           \
      exit(1);                                                                                    \
    }                                                                                             \
  } while (0)

#define NVJITLINK_CHECK(h, x)                                                                     \
  do {                                                                                            \
    nvJitLinkResult result = x;                                                                   \
    if (result != NVJITLINK_SUCCESS) {                                                            \
      std::cerr << "\nerror: " #x " failed with error " << result << '\n';                        \
      size_t lsize;                                                                               \
      result = nvJitLinkGetErrorLogSize(h, &lsize);                                               \
      if (result == NVJITLINK_SUCCESS && lsize > 0) {                                             \
        char* log = (char*)malloc(lsize);                                                         \
        result = nvJitLinkGetErrorLog(h, log);                                                    \
        if (result == NVJITLINK_SUCCESS) {                                                        \
          std::cerr << "error: " << log << '\n';                                                  \
          free(log);                                                                              \
        }                                                                                         \
      }                                                                                           \
      exit(1);                                                                                    \
    }                                                                                             \
  } while (0)

CUmodule compileModule(std::string program)
{
  // Compile lotir from program string ===============
  nvrtcProgram prog;
  NVRTC_CHECK(nvrtcCreateProgram(&prog, program.c_str(), "programRTC.cu", 0, NULL, NULL));

  const char* opts[] = {"-arch=compute_80", "-dlto", "-dc"};
  nvrtcResult compileResult = nvrtcCompileProgram(prog, 3, opts);

  // Obtain LTO-IR from the program.
  size_t lotirSize;
  NVRTC_CHECK(nvrtcGetLTOIRSize(prog, &lotirSize));
  char* lotir = new char[lotirSize];
  NVRTC_CHECK(nvrtcGetLTOIR(prog, lotir));

  // Link files ===============
  nvJitLinkHandle linker;
  const char* link_options[] = {"-arch=sm_80", "-lto"};
  nvJitLinkCreate(&linker, 2, link_options);

  NVJITLINK_CHECK(linker,
                  nvJitLinkAddFile(linker, NVJITLINK_INPUT_FATBIN, "func_link_nvrtc_lto.o"));

  NVJITLINK_CHECK(linker, nvJitLinkAddData(linker, NVJITLINK_INPUT_LTOIR, (void*)lotir, lotirSize,
                                           "programRTC.o"));

  NVJITLINK_CHECK(linker, nvJitLinkComplete(linker));

  // Create module ===============
  size_t cubin_size;
  NVJITLINK_CHECK(linker, nvJitLinkGetLinkedCubinSize(linker, &cubin_size));
  void* cubin = malloc(cubin_size);
  NVJITLINK_CHECK(linker, nvJitLinkGetLinkedCubin(linker, cubin));

  CUmodule module;
  CUDA_CHECK(cuModuleLoadData(&module, cubin));

  // Cleanup
  NVRTC_CHECK(nvrtcDestroyProgram(&prog));
  NVJITLINK_CHECK(linker, nvJitLinkDestroy(&linker));

  return module;
}
#endif // RTC

__device__ double func(double a, double b);
#ifdef FUNC
__device__ double func(double a, double b)
{
  return pow(a, b);
}
#endif

#ifdef MAIN
#ifdef RTC
std::string the_program = R"===(
__device__ double func(double a, double b);

extern "C" __global__ void kernel(double* out, double* a, double* b)
{
  size_t tid = blockIdx.x * blockDim.x + threadIdx.x;
  if(tid >= 1){
    return;
  }
  a[tid] = 2;
  b[tid] = 3;
  out[tid] = func(a[tid], b[tid]);
  printf("out[%lu] = %f\n", tid, out[tid]);
})===";
#else  // RTC
__global__ void kernel(double* out, double* a, double* b)
{
  size_t tid = blockIdx.x * blockDim.x + threadIdx.x;
  if (tid >= 1) {
    return;
  }
  a[tid] = 2;
  b[tid] = 3;
  out[tid] = func(a[tid], b[tid]);
  printf("out[%lu] = %f\n", tid, out[tid]);
}
#endif // RTC

int main()
{
  double* a;
  double* b;
  double* out;
  cudaMalloc((void**)&a, sizeof(double));
  cudaMalloc((void**)&b, sizeof(double));
  cudaMalloc((void**)&out, sizeof(double));

#ifdef RTC
  // Create context
  CUdevice cuDevice;
  CUcontext context;
  CUDA_CHECK(cuInit(0));
  CUDA_CHECK(cuDeviceGet(&cuDevice, 0));
  CUDA_CHECK(cuCtxCreate(&context, 0, cuDevice));

  CUmodule module = compileModule(the_program);

  CUfunction kernel;
  CUDA_CHECK(cuModuleGetFunction(&kernel, module, "kernel"));

  size_t n_blocks = 1;
  size_t n_threads = 1;
  void* args[] = {&out, &a, &b};
  CUDA_CHECK(cuLaunchKernel(kernel, n_blocks, 1, 1, // grid dim
                            n_threads, 1, 1,        // block dim
                            0, NULL,                // shared mem and stream
                            args, 0));              // arguments
  CUDA_CHECK(cuCtxSynchronize());

  // Cleanup
  CUDA_CHECK(cuModuleUnload(module));
  CUDA_CHECK(cuCtxDestroy(context));
#else  // RTC
  kernel<<<1, 1>>>(out, a, b);
  cudaDeviceSynchronize();
#endif // RTC

  return 0;
}
#endif // MAIN

bash script:

#!/bin/bash

set -e # stop script when an error occurs

SCRIPT=$1
xCCx=80 # CUDA compute compatibility

# Create program using a single translation file
echo -e "\n---------- main_single ----------\n"
nvcc -DFUNC -DMAIN $SCRIPT -o main_single \
    -gencode arch=compute_$xCCx,code=sm_$xCCx
./main_single # should print 'out[0] = 8.0'
cuobjdump main_single -res-usage | grep kernel -A1

# Link files using NVCC without link time optimization (code=compute_...)
echo -e "\n---------- main_link_nvcc ----------\n"
nvcc -DFUNC $SCRIPT -o func_link_nvcc.o -dc \
    -gencode arch=compute_$xCCx,code=compute_$xCCx
nvcc -DMAIN $SCRIPT -o main_link_nvcc.o -dc \
    -gencode arch=compute_$xCCx,code=compute_$xCCx
nvcc func_link_nvcc.o main_link_nvcc.o -o main_link_nvcc \
    -gencode arch=compute_$xCCx,code=sm_$xCCx
./main_link_nvcc # should print 'out[0] = 8.0'
cuobjdump main_link_nvcc -res-usage | grep kernel -A1

# Link files using NVCC with link time optimization (code=lto_...)
echo -e "\n---------- main_link_nvcc_lto ----------\n"
nvcc -DFUNC $SCRIPT -o func_link_nvcc_lto.o -dc \
    -gencode arch=compute_$xCCx,code=lto_$xCCx
nvcc -DMAIN $SCRIPT -o main_link_nvcc_lto.o -dc \
    -gencode arch=compute_$xCCx,code=lto_$xCCx
nvcc func_link_nvcc_lto.o main_link_nvcc_lto.o -o main_link_nvcc_lto -dlto \
    -gencode arch=compute_$xCCx,code=sm_$xCCx
./main_link_nvcc_lto # should print 'out[0] = 8.0'
cuobjdump main_link_nvcc_lto -res-usage | grep kernel -A1

# Link files using NVRTC and nvJitLink with link time optimization
echo -e "\n---------- main_link_nvrtc_lto ----------\n"
nvcc -DFUNC $SCRIPT -o func_link_nvrtc_lto.o -dc \
    -gencode arch=compute_$xCCx,code=lto_$xCCx -fatbin
nvcc -DMAIN -DRTC $SCRIPT -o main_link_nvrtc_lto \
    -lnvrtc_static -lnvrtc-builtins_static -lnvJitLink_static -lnvptxcompiler_static -lcuda -lpthread \
    -gencode arch=compute_$xCCx,code=sm_$xCCx
./main_link_nvrtc_lto # should print 'out[0] = 8.0'
ncu main_link_nvrtc_lto | grep register/thread


# Registers/thread used on my system with an NVIDIA RTX 3080:
# main_single          : 30 registers/thread
# main_link_nvcc       : 44 registers/thread
# main_link_nvcc_lto   : 30 registers/thread
# main_link_nvrtc_lto  : 32 registers/thread

Solution

  • First of all, there is unfortunately an error in the blog post with the CU_JIT_LTO value. It should instead be:

    values[0] = (void*)1;
    

    However, it doesn't really matter, as the value is ignored - it is just the presence of CU_JIT_LTO that is used. The CU_JIT_LTO should indeed be passed to cuLinkCreate as you discovered.

    For your sub-question, what your -ptx is doing is stopping the compilation after generating nvvm-ir, but that is an undocumented side-effect. The simpler and safer thing would be to just use:

    nvcc -dc -arch=compute_XX,code=lto_XX
    

    which creates a host object containing the nvvm-ir. Then pass that as:

    CU_JIT_INPUT_OBJECT to cuLinkAddFile().