sasscudaatomicgpu-shared-memory

Shared Memory's atomicAdd with int and float have different SASS


I encountered a performance issue, where the shared memory's atomicAdd on float is much more expensive than it on int after profiling with nv-nsight-cu-cli. After checking the generated SASS, I found the generated SASS of the shared memory's atomicAdd on float and int are not similar at all.

Here I show a example in minimal cuda code:

$ cat test.cu                                                                                                                                                                                                                                                   
__global__ void testAtomicInt() {
    __shared__ int SM_INT;
    SM_INT = 0;
    __syncthreads();
    atomicAdd(&(SM_INT), ((int)1));
}

__global__ void testAtomicFloat() {
    __shared__ float SM_FLOAT;
    SM_FLOAT = 0.0;
    __syncthreads();
    atomicAdd(&(SM_FLOAT), ((float)1.1));
}

$ nvcc -arch=sm_86 -c test.cu 
$ cuobjdump -sass test.o                                                                                                                                                                                                                                        
Fatbin elf code:
================
arch = sm_86
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

    code for sm_86
        Function : _Z15testAtomicFloatv
    .headerflags    @"EF_CUDA_SM86 EF_CUDA_PTX_SM(EF_CUDA_SM86)"
        /*0000*/                   MOV R1, c[0x0][0x28] ;                  /* 0x00000a0000017a02 */
                                                                           /* 0x000fc40000000f00 */
        /*0010*/                   STS [RZ], RZ ;                          /* 0x000000ffff007388 */
                                                                           /* 0x000fe80000000800 */
        /*0020*/                   BAR.SYNC 0x0 ;                          /* 0x0000000000007b1d */
                                                                           /* 0x000fec0000000000 */
        /*0030*/                   LDS R2, [RZ] ;                          /* 0x00000000ff027984 */
                                                                           /* 0x000e240000000800 */
        /*0040*/                   FADD R3, R2, 1.1000000238418579102 ;    /* 0x3f8ccccd02037421 */
                                                                           /* 0x001fcc0000000000 */
        /*0050*/                   ATOMS.CAST.SPIN R3, [RZ], R2, R3 ;      /* 0x00000002ff03738d */
                                                                           /* 0x000e240001800003 */
        /*0060*/                   ISETP.EQ.U32.AND P0, PT, R3, 0x1, PT ;  /* 0x000000010300780c */
                                                                           /* 0x001fda0003f02070 */
        /*0070*/              @!P0 BRA 0x30 ;                              /* 0xffffffb000008947 */
                                                                           /* 0x000fea000383ffff */
        /*0080*/                   EXIT ;                                  /* 0x000000000000794d */
                                                                           /* 0x000fea0003800000 */
        /*0090*/                   BRA 0x90;                               /* 0xfffffff000007947 */
                                                                           /* 0x000fc0000383ffff */
        /*00a0*/                   NOP;                                    /* 0x0000000000007918 */
        ..........


        Function : _Z13testAtomicIntv
    .headerflags    @"EF_CUDA_SM86 EF_CUDA_PTX_SM(EF_CUDA_SM86)"
        /*0000*/                   MOV R1, c[0x0][0x28] ;         /* 0x00000a0000017a02 */
                                                                  /* 0x000fc40000000f00 */
        /*0010*/                   STS [RZ], RZ ;                 /* 0x000000ffff007388 */
                                                                  /* 0x000fe80000000800 */
        /*0020*/                   BAR.SYNC 0x0 ;                 /* 0x0000000000007b1d */
                                                                  /* 0x000fec0000000000 */
        /*0030*/                   ATOMS.POPC.INC.32 RZ, [URZ] ;  /* 0x00000000ffff7f8c */
                                                                  /* 0x000fe2000d00003f */
        /*0040*/                   EXIT ;                         /* 0x000000000000794d */
                                                                  /* 0x000fea0003800000 */
        /*0050*/                   BRA 0x50;                      /* 0xfffffff000007947 */
                                                                  /* 0x000fc0000383ffff */
        /*0060*/                   NOP;                           /* 0x0000000000007918 */
        ..........



Fatbin ptx code:
================
arch = sm_86
code version = [7,5]
producer = <unknown>
host = linux
compile_size = 64bit
compressed
       

From the generated SASS code above, we could clearly obtain, that the shared memory's atomicAdd on int generates single lightweight ATOMS.POPC.INC.32 RZ, [URZ], while it on float generating a bunch of SASS with a heavyweight ATOMS.CAST.SPIN R3, [RZ], R2, R3 .

The CUDA Binary Utilities doesn't show me the meaning of CAST or SPIN. However, I could guess it means an exclusive spin lock on a shared memory address. (Correct me, if my guess goes wrong.) In my real code, none of the SASS of atomicAdd of int has a hotspot. However, this ATOMS.CAST.SPIN is significantly hotter than other SASS code generated by of the atomicAdd of float.

In addition, I tested with compiler flag -arch=sm_86, -arch=sm_80 and -arch=sm_75. Under those CCs, the generated SPSS code of atomicAdd of float is very similar. Another fact is, with no surprise, the atomicAdd of double generates SPSS alike it of float.


This observation caused me more confusion than questions. I would go with some simple questions from my profiling experience and hope we could have a nice discussion.

atomicAdd(&(SM_INT), ((int)1));  // no hotspot
atomicAdd(&(SM_FLOAT), ((float)1.1)); // a hotspot

Solution

  • I probably won't be able to provide an answer addressing every possible question. CUDA SASS is really not documented to the level to explain these things.

    What does exactly ATOMS.CAST.SPIN do? The only SASS document I am aware of is the CUDA Binary Utilities.

    ATOMS.CAST.SPIN
    ^^^^^ ^^^     
       ||   |  
       ||   compare and swap
       |shared
       atomic
    

    The programming guide gives an indication of how one can implement an "arbitrary" atomic operation, using atomic CAS (Compare And Swap). You should first familiarize yourself with how atomic CAS works.

    Regarding the "arbitrary atomic" example, the thing to note is that it can evidently be used to provide atomic operations for e.g. datatypes that are not supported by a "native" atomic instruction, such as atomic add. Another thing to note is that it is essentially a loop around an atomic CAS instruction, with the loop checking to see if the operation was "successful" or not. If it was "unsuccessful", the loop continues. If it was "successful", the loop exits.

    This is effectively what we see depicted in SASS code in your float example:

    /*0030*/  LDS R2, [RZ] ;  // get the current value in the location
    FADD R3, R2, 1.1000000238418579102 ; // perform ordinary floating-point add
    ATOMS.CAST.SPIN R3, [RZ], R2, R3 ;  //  attempt to atomically replace the result in the location
    ISETP.EQ.U32.AND P0, PT, R3, 0x1, PT ; // check if replacement was successful
    @!P0 BRA 0x30   // if not, loop and try again
    

    These are essentially the steps that are outlined in the "arbitrary atomic" example in the programming guide. Based on this I would conclude the following:

    Why should the atomicAdd of float generates more SASS code and does more work than it on int?

    Evidently, the architecture you are compiling for does not have a "native" implementation of atomic add for float, and so the compiler tool chain has chosen to implement this looping method for you. Since the loop effectively involves the possibility of success/failure which will determine whether this loop continues, and success/failure depends on other threads behavior (contention to perform the atomic), the looping method may do considerably more "work" than a native single instruction will.

    If it is more vulnerable to have more shared memory load and store conflict and thus more stall time for the atomicAdd of float than the atomicAdd of int?

    Yes, I personally would conclude that the native atomic method is more efficient, and the looping method may be less efficient, which could be expressed in a variety of ways in the profiler, such as warp stalls.

    EDIT:

    1. It's possible for things to be implemented/available in one GPU architecture but not another. This is certainly applicable to atomics, and you can see examples of this if you read the previously linked section on atomics in the programming guide. I don't know of any architectures today that are "newer" than cc8.0 or cc8.6 (Ampere) but it is certainly possible that the behavior of a future (or any other) GPU could be different here.

    2. This loop-around-atomicCAS method is distinct from a previous methodology (lock/update/unlock, which also involves a loop for lock negotiation) the compiler toolchain used on Kepler and prior architectures to provide atomics on shared memory when no formal SASS instructions existed to do so.