I have a simple matrix multiplication kernel running on CUDA.
When compiling using -lineinfo
command along with --ptxas-options -v
the register count is displayed as 28, whereas without the -lineinfo
option, the register count is 20.
Exact commands used:
nvcc -g -G --ptxas-options -v -arch=sm_86 -o mmul_dbg mmul.cu
and
nvcc -lineinfo --ptxas-options -v -arch=sm_86 -o mmul_ncu mmul.cu
I also checked with
nvcc --ptxas-options -v -arch=sm_86 -o mmul_dbg mmul.cu
and it yields 20 registers.
__global__ void matrixMul(const int *a, const int *b, int *c, int N) {
// Compute each thread's global row and column index
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
// Iterate over row, and down column
c[row * N + col] = 0;
for (int k = 0; k < N; k++) {
// Accumulate results for a single element
c[row * N + col] += a[row * N + k] * b[k * N + col];
}
}
What could be the reason for the increased register count?
Edit: nvcc is 12.3
Edit (2): removed image and added textual output
$ nvcc --ptxas-options -v -lineinfo -o wlineinfo -arch=sm_86 m mul.cu
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z9matrixMulPKiS0_Pii' for 'sm_86'
ptxas info : Function properties for _Z9matrixMulPKiS0_Pii
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 28 registers, 380 bytes cmem[0]
$ nvcc --ptxas-options -v -g -G -o wlineinfo -arch=sm_86 mmul.cu
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z9matrixMulPKiS0_Pii' for 'sm_86'
ptxas info : Function properties for _Z9matrixMulPKiS0_Pii
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 20 registers, 380 bytes cmem[0]
The reason for the difference is the use of the -G
switch. This selects compilation in debug mode. Once we acknowledge these statements, then we can say that the observation has nothing to do with the use of -lineinfo
.
In debug mode, many/most optimizations are disabled. One optimization the compiler may use but is disabled is loop unrolling.
In the non--G
case, the compiler implements loop unrolling. The overall number of instructions in the kernel is substantially higher, and a possible side effect of loop unrolling for performance is increased register pressure.
So due to loop unrolling in the non--G
case, the compiler has chosen a different register footprint to carry data. The character limits in the answer prevent me from providing full output for both cases, but you can get it yourself with the cuobjdump
tool. Here is a portion (first part) of the output from the unrolled/non--G
case:
# cuobjdump -sass wlineinfo
Fatbin elf code:
================
arch = sm_86
code version = [1,7]
host = linux
compile_size = 64bit
identifier = t128.cu
code for sm_86
Function : _Z9matrixMulPKiS0_Pii
.headerflags @"EF_CUDA_TEXMODE_UNIFIED EF_CUDA_64BIT_ADDRESS EF_CUDA_SM86 EF_CUDA_VIRTUAL_SM(EF_CUDA_SM86)"
/*0000*/ MOV R1, c[0x0][0x28] ; /* 0x00000a0000017a02 */
/* 0x000fc40000000f00 */
/*0010*/ S2R R9, SR_CTAID.Y ; /* 0x0000000000097919 */
/* 0x000e220000002600 */
/*0020*/ MOV R7, 0x4 ; /* 0x0000000400077802 */
/* 0x000fe20000000f00 */
/*0030*/ ULDC.64 UR4, c[0x0][0x118] ; /* 0x0000460000047ab9 */
/* 0x000fe40000000a00 */
/*0040*/ S2R R0, SR_TID.Y ; /* 0x0000000000007919 */
/* 0x000e280000002200 */
/*0050*/ S2R R8, SR_CTAID.X ; /* 0x0000000000087919 */
/* 0x000e680000002500 */
/*0060*/ S2R R3, SR_TID.X ; /* 0x0000000000037919 */
/* 0x000e620000002100 */
/*0070*/ IMAD R9, R9, c[0x0][0x4], R0 ; /* 0x0000010009097a24 */
/* 0x001fe200078e0200 */
/*0080*/ MOV R0, c[0x0][0x178] ; /* 0x00005e0000007a02 */
/* 0x000fc60000000f00 */
/*0090*/ IMAD R9, R9, c[0x0][0x178], RZ ; /* 0x00005e0009097a24 */
/* 0x000fe200078e02ff */
/*00a0*/ ISETP.GE.AND P0, PT, R0, 0x1, PT ; /* 0x000000010000780c */
/* 0x000fe20003f06270 */
/*00b0*/ IMAD R8, R8, c[0x0][0x0], R3 ; /* 0x0000000008087a24 */
/* 0x002fca00078e0203 */
/*00c0*/ IADD3 R2, R8, R9, RZ ; /* 0x0000000908027210 */
/* 0x000fca0007ffe0ff */
/*00d0*/ IMAD.WIDE R2, R2, R7, c[0x0][0x170] ; /* 0x00005c0002027625 */
/* 0x000fca00078e0207 */
/*00e0*/ STG.E [R2.64], RZ ; /* 0x000000ff02007986 */
/* 0x0001e2000c101904 */
/*00f0*/ @!P0 EXIT ; /* 0x000000000000894d */
/* 0x000fea0003800000 */
/*0100*/ IADD3 R4, R0, -0x1, RZ ; /* 0xffffffff00047810 */
/* 0x000fe40007ffe0ff */
/*0110*/ MOV R15, RZ ; /* 0x000000ff000f7202 */
/* 0x000fe40000000f00 */
/*0120*/ ISETP.GE.U32.AND P0, PT, R4, 0x3, PT ; /* 0x000000030400780c */
/* 0x000fe40003f06070 */
/*0130*/ LOP3.LUT R6, R0, 0x3, RZ, 0xc0, !PT ; /* 0x0000000300067812 */
/* 0x000fe400078ec0ff */
/*0140*/ MOV R11, RZ ; /* 0x000000ff000b7202 */
/* 0x000fd20000000f00 */
/*0150*/ @!P0 BRA 0xc80 ; /* 0x00000b2000008947 */
/* 0x000fea0003800000 */
/*0160*/ IADD3 R10, -R6, c[0x0][0x178], RZ ; /* 0x00005e00060a7a10 */
/* 0x000fe20007ffe1ff */
/*0170*/ IMAD.WIDE R4, R9, R7.reuse, c[0x0][0x160] ; /* 0x0000580009047625 */
/* 0x080fe200078e0207 */
/*0180*/ MOV R15, RZ ; /* 0x000000ff000f7202 */
/* 0x000fe40000000f00 */
/*0190*/ ISETP.GT.AND P0, PT, R10, RZ, PT ; /* 0x000000ff0a00720c */
/* 0x000fe20003f04270 */
/*01a0*/ IMAD.WIDE R12, R8, R7, c[0x0][0x168] ; /* 0x00005a00080c7625 */
/* 0x000fe200078e0207 */
/*01b0*/ IADD3 R4, P1, R4, 0x8, RZ ; /* 0x0000000804047810 */
/* 0x000fe40007f3e0ff */
/*01c0*/ MOV R11, RZ ; /* 0x000000ff000b7202 */
/* 0x000fe40000000f00 */
/*01d0*/ IADD3.X R5, RZ, R5, RZ, P1, !PT ; /* 0x00000005ff057210 */
/* 0x000fce0000ffe4ff */
/*01e0*/ @!P0 BRA 0xad0 ; /* 0x000008e000008947 */
/* 0x000fea0003800000 */
/*01f0*/ ISETP.GT.AND P1, PT, R10, 0xc, PT ; /* 0x0000000c0a00780c */
/* 0x000fe40003f24270 */
/*0200*/ PLOP3.LUT P0, PT, PT, PT, PT, 0x80, 0x0 ; /* 0x000000000000781c */
/* 0x000fd60003f0f070 */
/*0210*/ @!P1 BRA 0x7a0 ; /* 0x0000058000009947 */
/* 0x000fea0003800000 */
/*0220*/ PLOP3.LUT P0, PT, PT, PT, PT, 0x8, 0x0 ; /* 0x000000000000781c */
/* 0x000fc40003f0e170 */
/*0230*/ LDG.E R14, [R12.64] ; /* 0x000000040c0e7981 */
/* 0x000ea8000c1e1900 */
/*0240*/ LDG.E R16, [R4.64+-0x8] ; /* 0xfffff80404107981 */
/* 0x000ea4000c1e1900 */
/*0250*/ IMAD R19, R14, R16, R15 ; /* 0x000000100e137224 */
/* 0x004fe400078e020f */
/*0260*/ IMAD.WIDE R14, R0, 0x4, R12 ; /* 0x00000004000e7825 */
/* 0x008fc600078e020c */
/*0270*/ STG.E [R2.64], R19 ; /* 0x0000001302007986 */
/* 0x0003e8000c101904 */
/*0280*/ LDG.E R16, [R14.64] ; /* 0x000000040e107981 */
/* 0x000ea8000c1e1900 */
/*0290*/ LDG.E R17, [R4.64+-0x4] ; /* 0xfffffc0404117981 */
/* 0x000ea4000c1e1900 */
/*02a0*/ IMAD R21, R16, R17, R19 ; /* 0x0000001110157224 */
/* 0x004fc400078e0213 */
/*02b0*/ IMAD.WIDE R16, R0, 0x4, R14 ; /* 0x0000000400107825 */
/* 0x000fc600078e020e */
/*02c0*/ STG.E [R2.64], R21 ; /* 0x0000001502007986 */
/* 0x0005e8000c101904 */
/*02d0*/ LDG.E R18, [R16.64] ; /* 0x0000000410127981 */
/* 0x000ee8000c1e1900 */
/*02e0*/ LDG.E R12, [R4.64] ; /* 0x00000004040c7981 */
/* 0x000ee4000c1e1900 */
/*02f0*/ IMAD R23, R18, R12, R21 ; /* 0x0000000c12177224 */
/* 0x008fc400078e0215 */
/*0300*/ IMAD.WIDE R12, R0, 0x4, R16 ; /* 0x00000004000c7825 */
/* 0x000fc600078e0210 */
/*0310*/ STG.E [R2.64], R23 ; /* 0x0000001702007986 */
/* 0x0007e8000c101904 */
/*0320*/ LDG.E R18, [R12.64] ; /* 0x000000040c127981 */
/* 0x000e68000c1e1900 */
/*0330*/ LDG.E R14, [R4.64+0x4] ; /* 0x00000404040e7981 */
/* 0x000e64000c1e1900 */
/*0340*/ IMAD R19, R18, R14, R23 ; /* 0x0000000e12137224 */
/* 0x002fc400078e0217 */
/*0350*/ IMAD.WIDE R14, R0, 0x4, R12 ; /* 0x00000004000e7825 */
/* 0x000fc600078e020c */
/*0360*/ STG.E [R2.64], R19 ; /* 0x0000001302007986 */
/* 0x0003e8000c101904 */
/*0370*/ LDG.E R18, [R14.64] ; /* 0x000000040e127981 */
/* 0x000ea8000c1e1900 */
/*0380*/ LDG.E R16, [R4.64+0x8] ; /* 0x0000080404107981 */
/* 0x000ea4000c1e1900 */
/*0390*/ IMAD R21, R18, R16, R19 ; /* 0x0000001012157224 */
/* 0x004fc400078e0213 */
/*03a0*/ IMAD.WIDE R16, R0, 0x4, R14 ; /* 0x0000000400107825 */
/* 0x000fc600078e020e */
/*03b0*/ STG.E [R2.64], R21 ; /* 0x0000001502007986 */
/* 0x0005e8000c101904 */
/*03c0*/ LDG.E R18, [R16.64] ; /* 0x0000000410127981 */
/* 0x000ee8000c1e1900 */
/*03d0*/ LDG.E R12, [R4.64+0xc] ; /* 0x00000c04040c7981 */
/* 0x000ee4000c1e1900 */
/*03e0*/ IMAD R23, R18, R12, R21 ; /* 0x0000000c12177224 */
/* 0x008fc400078e0215 */
/*03f0*/ IMAD.WIDE R12, R0, 0x4, R16 ; /* 0x00000004000c7825 */
/* 0x000fc600078e0210 */
/*0400*/ STG.E [R2.64], R23 ; /* 0x0000001702007986 */
/* 0x0007e8000c101904 */
/*0410*/ LDG.E R18, [R12.64] ; /* 0x000000040c127981 */
/* 0x000e68000c1e1900 */
/*0420*/ LDG.E R14, [R4.64+0x10] ; /* 0x00001004040e7981 */
/* 0x000e64000c1e1900 */
/*0430*/ IMAD R19, R18, R14, R23 ; /* 0x0000000e12137224 */
/* 0x002fc400078e0217 */
/*0440*/ IMAD.WIDE R14, R0, 0x4, R12 ; /* 0x00000004000e7825 */
/* 0x000fc600078e020c */
/*0450*/ STG.E [R2.64], R19 ; /* 0x0000001302007986 */
/* 0x0003e8000c101904 */
/*0460*/ LDG.E R18, [R14.64] ; /* 0x000000040e127981 */
/* 0x000ea8000c1e1900 */
/*0470*/ LDG.E R16, [R4.64+0x14] ; /* 0x0000140404107981 */
/* 0x000ea4000c1e1900 */
/*0480*/ IMAD R21, R18, R16, R19 ; /* 0x0000001012157224 */
/* 0x004fc400078e0213 */
/*0490*/ IMAD.WIDE R16, R0, 0x4, R14 ; /* 0x0000000400107825 */
/* 0x000fc600078e020e */
/*04a0*/ STG.E [R2.64], R21 ; /* 0x0000001502007986 */
/* 0x0005e8000c101904 */
/*04b0*/ LDG.E R18, [R16.64] ; /* 0x0000000410127981 */
/* 0x000ee8000c1e1900 */
/*04c0*/ LDG.E R12, [R4.64+0x18] ; /* 0x00001804040c7981 */
/* 0x000ee4000c1e1900 */
/*04d0*/ IMAD R23, R18, R12, R21 ; /* 0x0000000c12177224 */
/* 0x008fc400078e0215 */
/*04e0*/ IMAD.WIDE R12, R0, 0x4, R16 ; /* 0x00000004000c7825 */
/* 0x000fc600078e0210 */
/*04f0*/ STG.E [R2.64], R23 ; /* 0x0000001702007986 */
/* 0x0007e8000c101904 */
/*0500*/ LDG.E R18, [R12.64] ; /* 0x000000040c127981 */
/* 0x000e68000c1e1900 */
/*0510*/ LDG.E R14, [R4.64+0x1c] ; /* 0x00001c04040e7981 */
/* 0x000e64000c1e1900 */
/*0520*/ IMAD R19, R18, R14, R23 ; /* 0x0000000e12137224 */
/* 0x002fc400078e0217 */
/*0530*/ IMAD.WIDE R14, R0, 0x4, R12 ; /* 0x00000004000e7825 */
/* 0x000fc600078e020c */
/*0540*/ STG.E [R2.64], R19 ; /* 0x0000001302007986 */
/* 0x000fe8000c101904 */
/*0550*/ LDG.E R18, [R14.64] ; /* 0x000000040e127981 */
/* 0x000ea8000c1e1900 */
/*0560*/ LDG.E R16, [R4.64+0x20] ; /* 0x0000200404107981 */
/* 0x000ea4000c1e1900 */
/*0570*/ IMAD R21, R18, R16, R19 ; /* 0x0000001012157224 */
/* 0x004fc400078e0213 */
/*0580*/ IMAD.WIDE R16, R0, 0x4, R14 ; /* 0x0000000400107825 */
/* 0x000fc600078e020e */
/*0590*/ STG.E [R2.64], R21 ; /* 0x0000001502007986 */
/* 0x0003e8000c101904 */
/*05a0*/ LDG.E R18, [R16.64] ; /* 0x0000000410127981 */
/* 0x000ee8000c1e1900 */
/*05b0*/ LDG.E R12, [R4.64+0x24] ; /* 0x00002404040c7981 */
/* 0x000ee4000c1e1900 */
/*05c0*/ IMAD R23, R18, R12, R21 ; /* 0x0000000c12177224 */
/* 0x008fc400078e0215 */
/*05d0*/ IMAD.WIDE R12, R0, 0x4, R16 ; /* 0x00000004000c7825 */
/* 0x000fc600078e0210 */
/*05e0*/ STG.E [R2.64], R23 ; /* 0x0000001702007986 */
/* 0x0005e8000c101904 */
/*05f0*/ LDG.E R18, [R12.64] ; /* 0x000000040c127981 */
/* 0x000ee8000c1e1900 */
/*0600*/ LDG.E R14, [R4.64+0x28] ; /* 0x00002804040e7981 */
/* 0x000ee4000c1e1900 */
/*0610*/ IMAD R25, R18, R14, R23 ; /* 0x0000000e12197224 */
/* 0x008fc400078e0217 */
/*0620*/ IMAD.WIDE R14, R0, 0x4, R12 ; /* 0x00000004000e7825 */
/* 0x000fc600078e020c */
/*0630*/ STG.E [R2.64], R25 ; /* 0x0000001902007986 */
/* 0x0007e8000c101904 */
/*0640*/ LDG.E R18, [R14.64] ; /* 0x000000040e127981 */
/* 0x000e68000c1e1900 */
/*0650*/ LDG.E R16, [R4.64+0x2c] ; /* 0x00002c0404107981 */
/* 0x000e64000c1e1900 */
/*0660*/ IMAD R21, R18, R16, R25 ; /* 0x0000001012157224 */
/* 0x002fc400078e0219 */
/*0670*/ IMAD.WIDE R16, R0, 0x4, R14 ; /* 0x0000000400107825 */
/* 0x000fc600078e020e */
/*0680*/ STG.E [R2.64], R21 ; /* 0x0000001502007986 */
/* 0x0007e8000c101904 */
/*0690*/ LDG.E R18, [R16.64] ; /* 0x0000000410127981 */
/* 0x000ea8000c1e1900 */
/*06a0*/ LDG.E R12, [R4.64+0x30] ; /* 0x00003004040c7981 */
/* 0x000ea2000c1e1900 */
/*06b0*/ IADD3 R10, R10, -0x10, RZ ; /* 0xfffffff00a0a7810 */
/* 0x000fe20007ffe0ff */
/*06c0*/ IMAD R23, R18, R12, R21 ; /* 0x0000000c12177224 */
/* 0x004fc400078e0215 */
/*06d0*/ IMAD.WIDE R18, R0, 0x4, R16 ; /* 0x0000000400127825 */
/* 0x000fc600078e0210 */
/*06e0*/ STG.E [R2.64], R23 ; /* 0x0000001702007986 */
/* 0x0007e8000c101904 */
/*06f0*/ LDG.E R12, [R18.64] ; /* 0x00000004120c7981 */
/* 0x000ea8000c1e1900 */
/*0700*/ LDG.E R15, [R4.64+0x34] ; /* 0x00003404040f7981 */
/* 0x0002a2000c1e1900 */
/*0710*/ ISETP.GT.AND P1, PT, R10, 0xc, PT ; /* 0x0000000c0a00780c */
/* 0x000fe40003f24270 */
/*0720*/ IADD3 R14, P2, R4, 0x40, RZ ; /* 0x00000040040e7810 */
/* 0x000fc40007f5e0ff */
/*0730*/ IADD3 R11, R11, 0x10, RZ ; /* 0x000000100b0b7810 */
/* 0x000fe40007ffe0ff */
/*0740*/ IADD3.X R5, RZ, R5, RZ, P2, !PT ; /* 0x00000005ff057210 */
/* 0x002fe400017fe4ff */
/*0750*/ MOV R4, R14 ; /* 0x0000000e00047202 */
/* 0x000fe20000000f00 */
/*0760*/ IMAD R15, R12, R15, R23 ; /* 0x0000000f0c0f7224 */
/* 0x004fe400078e0217 */
/*0770*/ IMAD.WIDE R12, R0, 0x4, R18 ; /* 0x00000004000c7825 */
/* 0x000fc600078e0212 */
/*0780*/ STG.E [R2.64], R15 ; /* 0x0000000f02007986 */
/* 0x0007e2000c101904 */
/*0790*/ @P1 BRA 0x230 ; /* 0xfffffa9000001947 */
At the tail end of the above listing, you will find a sequence of instructions that repeats, roughly like this:
LDG // load A element
LDG // load B element
IMAD // 64-bit
IMAD // integer multiply of A and B
STG // store C element
That repeating sequence represents the unrolled loop body. If you use the cuobjdump
tool to study the -G
code, you will find: 1. a fewer number of instructions overall, 2. no repeating sequence as indicated above.
I acknowledge this answer does not provide a detailed, precise description of the reason for the increased register use in the optimized case. That would require more careful analysis and counting, as well as probably some conjecture about compiler behavior.
Loop unrolling by itself does not necessarily/automatically imply increased register usage, but the two are often related.