When writing PTX in a separate file, a kernel parameter can be loaded into a register with:
.reg .u32 test;
ld.param.u32 test, [test_param];
However, when using inline PTX, the Using Inline PTX Assembly in CUDA (version 01) application note describes a syntax where loading a parameter is closely linked to another operation. It provides this example:
asm("add.s32 %0, %1, %2;" : "=r"(i) : "r"(j), "r"(k));
Which generates:
ld.s32 r1, [j];
ld.s32 r2, [k];
add.s32 r3, r1, r2;
st.s32 [i], r3;
In many cases, it is necessary to separate the two operations. For instance, one might want to store the parameter in a register outside of a loop and then reuse and modify the register inside a loop. The only way I have found to do this is to use an extra mov instruction, to move the parameter from the register to which it was implicitly loaded, to another register I can use later.
Is there a way to avoid this additional mov instruction when moving from PTX in a separate file to inline PTX?
If I were you I wouldn't worry too much about those mov operations.
Keep in mind that PTX is not the final assembly code.
PTX is further compiled into CUBIN before the kernel launch. Among others, this last step performs register allocation and will remove all unnecessary mov
operations.
In particular, if you move from %r1
to %r2
and then never ever use %r1
at all, the algorithm is likely to assign %r1
and %r2
to the same hardware register and remove the move.