If I understand the workflow description in the NVRTC documentation correctly, here's how it works:
cuLinkCreate
, cuLinkAddData
, cuLinkComplete
) to get the cubin.However... beginning with CUDA 11.3, NVRTC has the following API call :
nvrtcResult nvrtcGetCUBIN ( nvrtcProgram prog, char* cubin );
So how can I have a cubin after compilation only?
Well, on the host side you get proper machine code after just compilation, so why not on the device side?
It seems that cubin availability depends on what you targeted with your compilation:
If you targeted a "virtual architecture", i.e. a certain compute capability (e.g. compute_60
- then the only thing you can get is the PTX, which is not yet specific to any microarchitecture.
If you targeted a concrete (micro-)architecture (e.g. sm_70
), then compilation can proceed all the way to SASS assembly placed in cubin.
Now, when you link using the CUDA driver, you have a context at play, and that's always associated with a physical GPU - a concrete micro-architecture. So that necessarily gives you a cubin.
PS:
--dlink-time-opt
.nvrtcGetCUBIN()
at all. This seems to also have effected the creation of modules, i.e. whether you can create a module using the PTX vs the CUBIN.