I have a simple kernel as
__global__ void hello_cuda() {
int a = 10;
printf("hello from GPU\n");
}
When I use Nsight compute to see the Source and SASS section, I see:
# Address Source
1 00007fe8 82f9ca00 IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28]
2 00007fe8 82f9ca10 UMOV UR4, 0x0
3 00007fe8 82f9ca20 CS2R R6, SRZ
4 00007fe8 82f9ca30 UMOV UR5, 0x0
5 00007fe8 82f9ca40 MOV R4, UR4
6 00007fe8 82f9ca50 IMAD.U32 R5, RZ, RZ, UR5
7 00007fe8 82f9ca60 MOV R20, 0x0
8 00007fe8 82f9ca70 MOV R21, 0x0
9 00007fe8 82f9ca80 CALL.ABS.NOINC 0x0
10 00007fe8 82f9ca90 EXIT
11 00007fe8 82f9caa0 BRA 0x7fe882f9caa0
12 00007fe8 82f9cab0 NOP
13 00007fe8 82f9cac0 NOP
14 00007fe8 82f9cad0 NOP
15 00007fe8 82f9cae0 NOP
16 00007fe8 82f9caf0 NOP
But when I do cuobjdump -sass saas_source_mapping_trial
, I get
code for sm_52
Function : _Z10hello_cudav
.headerflags @"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)"
/* 0x001fc400fe2007f6 */
/*0008*/ MOV R1, c[0x0][0x20] ; /* 0x4c98078000870001 */
/*0010*/ MOV32I R4, 0x0 ; /* 0x010000000007f004 */
/*0018*/ MOV32I R5, 0x0 ; /* 0x010000000007f005 */
/* 0x001ff400fec007f1 */
/*0028*/ MOV R6, RZ ; /* 0x5c9807800ff70006 */
/*0030*/ MOV R7, RZ ; /* 0x5c9807800ff70007 */
/*0038*/ JCAL 0x0 ; /* 0xe220000000000040 */
/* 0x001ffc00fda007ef */
/*0048*/ NOP ; /* 0x50b0000000070f00 */
/*0050*/ NOP ; /* 0x50b0000000070f00 */
/*0058*/ EXIT ; /* 0xe30000000007000f */
/* 0x001f8000fc0007ff */
/*0068*/ BRA 0x60 ; /* 0xe2400fffff07000f */
/*0070*/ NOP; /* 0x50b0000000070f00 */
/*0078*/ NOP; /* 0x50b0000000070f00 */
Why is there a difference between my SASS and the NCU one?
Unless you are running on a cc5.2 device, the nsight compute won't match the sass obtained with cuobjdump. cuobjdump shows you the compiled code in the binary. But if you have compiled with PTX included, and are running on a different architecture (than cc5.2 in this case) then the JIT mechanism will recompile your device code on the fly. nsight compute will "see" that recompiled device code. So the first step in harmonization/sanity in this situation is to compile for the device architecture you are running on, and make sure you are comparing SASS that is intended for the same architecture
(I'm fairly confident you are not comparing apples to apples here. The cuobjdump SASS indicates it was compiled for sm_52, a maxwell architecture, whereas the ncu SASS includes a UMOV instruction, which targets the uniform datapath on a turing or newer GPU.)
How did NCU get the address? The address from my cuobjdump is not the same at all.
this is typical for runtime loaders in many architectures. The executable object (i.e. kernel) is placed into memory during the runtime loader process. That will affect the addresses displayed. When you are doing static analysis with cuobjdump, the "addresses" you see are relative addresses to the "base". Once the code is loaded into a particular place in memory, the addresses are different. They don't start at zero as the relative addresses do.
And, again, since the actual instructions are different, some of the offsets may be different. First step towards sanity is to compile for the architecture you are running on, then use cuobjdump.