[SOLVED] Does the CUDA JIT compiler perform device link-time optimization?

Does the CUDA JIT compiler perform device link-time optimization?

Before device link-time optimization (DLTO) was introduced in CUDA 11.2, it was relatively easy to ensure forward compatibility without worrying too much about differences in performance. You would typically just create a fatbinary containing PTX for the lowest possible arch and SASS for the specific architectures you would normally target. For any future GPU architectures, the JIT compiler would then assemble the PTX into SASS optimized for that specific GPU arch.

Now, however, with DLTO, it is less clear to me how to ensure forward compatibility and maintain performance on those future architectures.

Let’s say I compile/link an application using nvcc with the following options:

Compile

-gencode=arch=compute_52,code=[compute_52,lto_52]
-gencode=arch=compute_61,code=lto_61

Link

-gencode=arch=compute_52,code=[sm_52,sm_61] -dlto

This will create a fatbinary containing PTX for cc_52, LTO intermediaries for sm_52 and sm_61, and link-time optimized SASS for sm_52 and sm_61 (or at least this appears to be the case when dumping the resulting fatbin sections using cuobjdump -all anyway).

Assuming the above is correct, what happens when the application is run on a later GPU architecture (e.g. sm_70)? Does the JIT compiler just assemble the cc_52 PTX without using link-time optimization (resulting in less optimal code)? Or does it somehow link the LTO intermediaries using link-time optimization? Is there a way to determine/guide what the JIT compiler is doing?

Solution

According to an NVIDIA employee on the CUDA forums the answer is "not yet":

Good question. We are working on support for JIT LTO, but in 11.2 it is not supported. So in the example you give at JIT time it will JIT each individual PTX to cubin and then do a cubin link. This is the same as we have always done for JIT linking. But we should have more support for JIT LTO in future releases.