I encountered a performance issue, where the shared memory's atomicAdd
on float
is much more expensive than it on int
after profiling with nv-nsight-cu-cli
.
After checking the generated SASS, I found the generated SASS of the shared memory's atomicAdd
on float
and int
are not similar at all.
Here I show a example in minimal cuda code:
$ cat test.cu
__global__ void testAtomicInt() {
__shared__ int SM_INT;
SM_INT = 0;
__syncthreads();
atomicAdd(&(SM_INT), ((int)1));
}
__global__ void testAtomicFloat() {
__shared__ float SM_FLOAT;
SM_FLOAT = 0.0;
__syncthreads();
atomicAdd(&(SM_FLOAT), ((float)1.1));
}
$ nvcc -arch=sm_86 -c test.cu
$ cuobjdump -sass test.o
Fatbin elf code:
================
arch = sm_86
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
code for sm_86
Function : _Z15testAtomicFloatv
.headerflags @"EF_CUDA_SM86 EF_CUDA_PTX_SM(EF_CUDA_SM86)"
/*0000*/ MOV R1, c[0x0][0x28] ; /* 0x00000a0000017a02 */
/* 0x000fc40000000f00 */
/*0010*/ STS [RZ], RZ ; /* 0x000000ffff007388 */
/* 0x000fe80000000800 */
/*0020*/ BAR.SYNC 0x0 ; /* 0x0000000000007b1d */
/* 0x000fec0000000000 */
/*0030*/ LDS R2, [RZ] ; /* 0x00000000ff027984 */
/* 0x000e240000000800 */
/*0040*/ FADD R3, R2, 1.1000000238418579102 ; /* 0x3f8ccccd02037421 */
/* 0x001fcc0000000000 */
/*0050*/ ATOMS.CAST.SPIN R3, [RZ], R2, R3 ; /* 0x00000002ff03738d */
/* 0x000e240001800003 */
/*0060*/ ISETP.EQ.U32.AND P0, PT, R3, 0x1, PT ; /* 0x000000010300780c */
/* 0x001fda0003f02070 */
/*0070*/ @!P0 BRA 0x30 ; /* 0xffffffb000008947 */
/* 0x000fea000383ffff */
/*0080*/ EXIT ; /* 0x000000000000794d */
/* 0x000fea0003800000 */
/*0090*/ BRA 0x90; /* 0xfffffff000007947 */
/* 0x000fc0000383ffff */
/*00a0*/ NOP; /* 0x0000000000007918 */
..........
Function : _Z13testAtomicIntv
.headerflags @"EF_CUDA_SM86 EF_CUDA_PTX_SM(EF_CUDA_SM86)"
/*0000*/ MOV R1, c[0x0][0x28] ; /* 0x00000a0000017a02 */
/* 0x000fc40000000f00 */
/*0010*/ STS [RZ], RZ ; /* 0x000000ffff007388 */
/* 0x000fe80000000800 */
/*0020*/ BAR.SYNC 0x0 ; /* 0x0000000000007b1d */
/* 0x000fec0000000000 */
/*0030*/ ATOMS.POPC.INC.32 RZ, [URZ] ; /* 0x00000000ffff7f8c */
/* 0x000fe2000d00003f */
/*0040*/ EXIT ; /* 0x000000000000794d */
/* 0x000fea0003800000 */
/*0050*/ BRA 0x50; /* 0xfffffff000007947 */
/* 0x000fc0000383ffff */
/*0060*/ NOP; /* 0x0000000000007918 */
..........
Fatbin ptx code:
================
arch = sm_86
code version = [7,5]
producer = <unknown>
host = linux
compile_size = 64bit
compressed
From the generated SASS code above, we could clearly obtain, that the shared memory's atomicAdd
on int
generates single lightweight ATOMS.POPC.INC.32 RZ, [URZ]
, while it on float
generating a bunch of SASS with a heavyweight ATOMS.CAST.SPIN R3, [RZ], R2, R3
.
The CUDA Binary Utilities doesn't show me the meaning of CAST
or SPIN
. However, I could guess it means an exclusive spin lock on a shared memory address. (Correct me, if my guess goes wrong.)
In my real code, none of the SASS of atomicAdd
of int
has a hotspot. However, this ATOMS.CAST.SPIN
is significantly hotter than other SASS code generated by of the atomicAdd
of float
.
In addition, I tested with compiler flag -arch=sm_86
, -arch=sm_80
and -arch=sm_75
. Under those CCs, the generated SPSS code of atomicAdd
of float
is very similar. Another fact is, with no surprise, the atomicAdd
of double
generates SPSS alike it of float
.
This observation caused me more confusion than questions. I would go with some simple questions from my profiling experience and hope we could have a nice discussion.
ATOMS.CAST.SPIN
do? The only SASS document I am aware of is the CUDA Binary Utilities.atomicAdd
of float
generates more SASS code and does more work than it on int
? I know it is a general question and hard to be answered. Maybe the ATOMS.POPC.INC
simply doesn't apply to data type float
or double
?atomicAdd
of float
than the atomicAdd
of int
? The former has clearly more instruction to be executed and divergent branches. I have the following code snippet in my project where the number of function calls on two functions is the same. However, the atomicAdd
of float
builds a runtime bottleneck while it is on int
doesn't.atomicAdd(&(SM_INT), ((int)1)); // no hotspot
atomicAdd(&(SM_FLOAT), ((float)1.1)); // a hotspot
I probably won't be able to provide an answer addressing every possible question. CUDA SASS is really not documented to the level to explain these things.
What does exactly ATOMS.CAST.SPIN do? The only SASS document I am aware of is the CUDA Binary Utilities.
ATOMS.CAST.SPIN
^^^^^ ^^^
|| |
|| compare and swap
|shared
atomic
The programming guide gives an indication of how one can implement an "arbitrary" atomic operation, using atomic CAS (Compare And Swap). You should first familiarize yourself with how atomic CAS works.
Regarding the "arbitrary atomic" example, the thing to note is that it can evidently be used to provide atomic operations for e.g. datatypes that are not supported by a "native" atomic instruction, such as atomic add. Another thing to note is that it is essentially a loop around an atomic CAS instruction, with the loop checking to see if the operation was "successful" or not. If it was "unsuccessful", the loop continues. If it was "successful", the loop exits.
This is effectively what we see depicted in SASS code in your float
example:
/*0030*/ LDS R2, [RZ] ; // get the current value in the location
FADD R3, R2, 1.1000000238418579102 ; // perform ordinary floating-point add
ATOMS.CAST.SPIN R3, [RZ], R2, R3 ; // attempt to atomically replace the result in the location
ISETP.EQ.U32.AND P0, PT, R3, 0x1, PT ; // check if replacement was successful
@!P0 BRA 0x30 // if not, loop and try again
These are essentially the steps that are outlined in the "arbitrary atomic" example in the programming guide. Based on this I would conclude the following:
ptxas
, but could also be the JIT system), as a convenience feature, is automatically implementing this looping method for you, rather than throwing a compile errorWhy should the atomicAdd of float generates more SASS code and does more work than it on int?
Evidently, the architecture you are compiling for does not have a "native" implementation of atomic add for float
, and so the compiler tool chain has chosen to implement this looping method for you. Since the loop effectively involves the possibility of success/failure which will determine whether this loop continues, and success/failure depends on other threads behavior (contention to perform the atomic), the looping method may do considerably more "work" than a native single instruction will.
If it is more vulnerable to have more shared memory load and store conflict and thus more stall time for the atomicAdd of float than the atomicAdd of int?
Yes, I personally would conclude that the native atomic method is more efficient, and the looping method may be less efficient, which could be expressed in a variety of ways in the profiler, such as warp stalls.
EDIT:
It's possible for things to be implemented/available in one GPU architecture but not another. This is certainly applicable to atomics, and you can see examples of this if you read the previously linked section on atomics in the programming guide. I don't know of any architectures today that are "newer" than cc8.0 or cc8.6 (Ampere) but it is certainly possible that the behavior of a future (or any other) GPU could be different here.
This loop-around-atomicCAS method is distinct from a previous methodology (lock/update/unlock, which also involves a loop for lock negotiation) the compiler toolchain used on Kepler and prior architectures to provide atomics on shared memory when no formal SASS instructions existed to do so.