cuda

what's cga in cuda programming model


Hi I could understand CTA, which is cooperative thread arrays. But what's the CGA? What's the relationship between cta and cga. I don't see a document that could well explain these.


Solution

  • CGA is a new addition to cooperative groups in the Hopper achitecture.

    To disambiguate:

    New thread block cluster feature enables programmatic control of locality at a granularity larger than a single thread block on a single SM. This extends the CUDA programming model by adding another level to the programming hierarchy to now include threads, thread blocks, thread block clusters, and grids. Clusters enable multiple thread blocks running concurrently across multiple SMs to synchronize and collaboratively fetch and exchange data.

    Annoyingly they name it a thread block cluster here and not a cooperative grid array. And also they used the word cluster in a different context in the Ampere docs.

    All these groupings are implemented using the cooperative groups API.

    See: Cooperative groups in CUDA
    The documentation is here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cooperative-groups

    And a blog post is here: https://developer.nvidia.com/blog/cooperative-groups/

    CGA specifically is mentioned in the technical block for new CUDA 12 features (for Hopper), namely:

    Support for C intrinsics for cooperative grid array (CGA) relaxed barriers

    These are documented here in the CUDA programming guide as:

    barrier_arrive and barrier_wait member functions were added for grid_group and thread_block. Description of the API is available here.

    These barriers are a big deal, because that is how threads can synchronize which is vital if they are to cooperate harmoniously. We could always do this through global memory, but that must go through L2 cache (200 cycles) or even through main memory (500 cycles). On Hopper this can go across a special matrix bus between the shared memory (which is a L1 kind of cache), ergo much faster communication.

    This innovation is enabled as follows. On Ampere and before, each SM (aka block) has its own private shared memory area. However, on Hopper every SM can access the shared memory of every other SM, something NVidia calls a cluster. This allows for very efficient inter block communication between SMs. These new barriers are implemented using the fast access to shared memory between blocks in the cluster.

    This mechanism to access another blocks shared memory is well not documented in the CUDA programming guide, but is detailed in the PTX ISA documention.
    In the CUDA programming guide it is buried in the memcpy_async section.

    *) On Volta and later threads in a warp can diverge, but doing so is computationally expensive.