cachingmemorycudagpu

Load/Store caching of NVIDIA GPU


I have a question from the book "Professional CUDA C Programming"

It says about the GPU cache:

On the CPU, both memory loads and stores can be cached. However, on the GPU only memory load operations can be cached; memory store operations cannot be cached. [p142]

But on the other page, it says:

Global memory loads/stores are staged through caches. [p158]

I'm really confused whether the GPU cache the store or not.

If the first quote is correct, I understand it as GPU does not cache the write (the modification of data).

Thus, the write directly goes to the global memory, DRAM

Also, is it similar as "No-Write Allocate" of CPU??

I want some clear explanation from you guys... Thanks!


Solution

  • Even the ancient Fermi architecture (compute capability 2.x) cached stores in L2:

    Fermi features a 768 KB unified L2 cache that services all load, store, and texture requests.
    NVIDIA Fermi Compute Architecture Whitepaper (PDF) (emphasis mine)

    So the book seems to be talking about write-caching in L1 data cache specifically.

    The short answer regarding write-caching in L1 is that since the Volta architecture (compute capability 7.0, newer than the "Professional CUDA C Programming" book) stores can certainly be cached in L1:

    The GV100 L1 cache improves performance in a variety of situations where shared memory is not the best choice or is unable to work. With Volta GV100, the merging of shared memory and L1 delivers a high-speed path to global memory capable of streaming access with unlimited cache- misses in flight. Prior NVIDIA GPUs only performed load caching, while GV100 introduces write-caching (caching of store operations) to further improve performance.
    Volta Architecture Whitepaper (PDF) (emphasis mine)

    Like Volta, Turing’s L1 can cache write operations (write-through). The result is that for many applications Volta and Turing narrow the performance gap between explicitly managed shared memory and direct access to device memory.
    Turing Tuning Guide from CUDA Toolkit Documentation 12.6 Update 3 (emphasis mine)

    But there is inconsistent information in the Programming Guide:

    Compute Capability 5.x - Global Memory

    [...] Data that is not read-only for the entire lifetime of the kernel cannot be cached in the unified L1/texture cache for devices of compute capability 5.0. [...]

    Compute Capability 7.x - Global Memory

    Global memory behaves the same way as in devices of compute capability 5.x (See Global Memory [section for C.C. 5.x]).
    CUDA C++ Programming Guide from CUDA Toolkit Documentation 12.6 Update 3

    This seems to be the cause of a lot of confusion. I will report this discrepancy between Programming Guide and Tuning Guides as a bug and report back.

    All architectures since Volta/Turing feature the same on-chip unified data cache architecture. There is nothing to be found in their tuning guides regarding what is cached in L1, so I assume that these newer architectures (Ampere, Ada, Hopper and Blackwell) behave the same way as Volta/Turing with regards to caching global loads and stores in L1.

    Understanding the zoo of caches and their names and behaviors can be hard without understanding their history therefore I compiled the following overview:

    A History of Caching for Nvidia Architectures

    Shorthands:

    For context, pre-Volta architectures usually did not even cache global loads in L1 cache. Instead one could use the on-chip ROD for some read-caching of global memory via LDG since Kepler. If I interpret the old docs correctly, this replaced the Fermi feature of uniform read-only global loads (LDU) going through ROC.

    It seems the reason for this decision only to cache read-only data by default was that developers were expected to manually cache data in shared memory when it was accessed more than once. In contrast to shared memory, global memory was seen as the way to communicate between thread-blocks and therefore between SMs. In this context it makes sense not to cache writable global memory on the SM as it could cause bugs with stale caches as there was and is no coherency mechanism for L1.

    In its most feature-bare form (not even local memory caching) on Maxwell, the unified L1/TEX only staged writable memory for coalescing which corresponds to OP's second citation from the "Professional CUDA C Programming" book, at least for loads:

    The unified L1/texture cache acts as a coalescing buffer for memory accesses, gathering up the data requested by the threads of a warp prior to delivery of that data to the warp. This function previously was served by the separate L1 cache in Fermi and Kepler.
    Maxwell Tuning Guide from CUDA Toolkit Documentation 12.6 Update 3

    I.e. staging data through L1 does not mean making in available to any future load from the same address. I imagine this as cache lines being instantly invalidated once consumed by the SM.

    When designing the Volta architecture with its (re-) unified L1/TEX + shared memory, Nvidia seemingly put a higher priority on the ability to do fast prototyping of CUDA kernels and get passable performance without manually caching in shared memory. One can argue that communication between SMs via global memory is a rather advanced technique and developers using it should know how to avoid reading stale data from L1.

    Architecture Tesla Fermi Kepler Maxwell Pascal <- Volta and newer
    Compute Capability 1.x 2.x 3.x 5.x 6.0 6.y 7.0 - 12.0
    unified L1 + shared memory
    unified L1 + ROD = L1/TEX
    local LD/ST -> L1
    constant LD -> ROC
    texture LD -> TEX
    global LDU -> ROC
    global LDG -> ROD/TEX
    global LD -> L1 1 2 2 2
    global ST -> L1
    all LD/ST -> L2

    1: Can be disabled via -Xptxas -dlcm=cg (See CUDA C Programming Guide 4.0 (PDF, on Hunter College website): F.4.2).
    2: Can be enabled via -Xptxas -dlcm=ca for certain GPUs (Kepler GK110B, GK20A, GK210, Maxwell GM204 and Pascal GP104).

    Looking at the assembly generated from LDG for Volta and newer, it causes the .nc (non-coherent) modifier to the ld.global in PTX which is still reflected in SASS as well (.CONSTANT, has nothing to do with constant memory which has its own SASS load instructions). So for some reason LDG is still handled differently from a ordinary global load even though both should go through the same unified L1/TEX cache which unlike the partitioning off of shared memory doesn't feature a (documented) separation for its L1 and TEX capabilities. Maybe this is just about priority/cache eviction policy?

    Cache Operators/Hints

    For a deeper dive, take a look at the PTX ISA's "Cache Operators for Memory Store Instructions" (also available as CUDA C++ intrinsics called "Store Functions Using Cache Hints"):

    Operator Meaning
    .wb Cache write-back all coherent levels. The default store instruction cache operation is st.wb, which writes back cache lines of coherent cache levels with normal eviction policy. [...]
    .cg Cache at global level (cache in L2 and below, not L1). Use st.cg to cache global store data only globally, bypassing the L1 cache, and cache only in the L2 cache.
    .cs Cache streaming, likely to be accessed once. The st.cs store cached-streaming operation allocates cache lines with evict-first policy to limit cache pollution by streaming output data.
    .wt Cache write-through (to system memory). The st.wt store write-through operation applied to a global System Memory address writes through the L2 cache.

    Parallel Thread Execution ISA Version 8.5 from CUDA Toolkit Documentation 12.6 Update 3


    Edit: After having found the original version of this table in the PTX ISA for Fermi and reading the additional Fermi-specific information in that version, I now know that st.wb/st.wt vs st.cg was introduced primarily for st.local and not st.global:

    Operator Meaning
    .wb [...] Data stored to local per-thread memory is cached in L1 and L2 with with write-back. However, sm_20 does NOT cache global store data in L1 because multiple L1 caches are not coherent for global data. Global stores bypass L1, and discard any L1 lines that match, regardless of the cache operation. Future GPUs may have globally-coherent L1 caches, in which case st.wb could write-back global store data from L1.
    .cg [...] In sm_20, st.cg is the same as st.wb for global data, but st.cg to local memory uses the L1 cache, and marks local L1 lines evict-first.
    .cs [...]
    .wt [...] The st.wt store write-through operation applied to a global System Memory address writes through the L2 cache, to allow a CPU program to poll a SysMem location written by the GPU with st.wt. Addresses not in System Memory use normal write-back.

    Parallel Thread Execution ISA Version 2.3 (PDF, on Drexel University website) originally from CUDA Toolkit 4.0

    Due to the thread-private nature of local memory, the per-SM L1 cache is coherent for local accesses. While my previous interpretation (see below) might still be correct for global memory access on Volta and newer GPUs, my arguments were flawed, so take the following with a grain of salt.


    Previous Interpretation of Cache Operators/Hints Considering Only st.global but Not st.local

    This table is written confusingly (maybe to avoid describing architecture-specific behavior) but given the information we already have about L1 write-caching, the best interpretation I can come up with is that st.wb and st.wt concern how the write is handled by L2 while leaving L1 write-caching up to the particular architecture as L1 is not a coherent level and does probably not contain the necessary logic to implement write-back. As the description for st.wb does not concern the handling in non-coherent levels at all this is fine.

    One can think of the L1 write-caching as always write-through (i.e. eagerly writing to the next level) but with invalidation of the L1 cache-line on pre-Volta architectures which is not what one typically thinks of for "write-through" but should still be fine.

    st.cg explicitly disallows caching in L1, i.e. it should always reproduce the behavior of pre-Volta architectures. And st.cs does not mention the cache levels at all and just determines the eviction policy. This interpretation agrees with the one given at Making better sense of the PTX store caching modes (assuming a Volta or later architecture).

    So stores are always cached in L2 while L1 write-caching depends on the GPU architecture and the actual store-instruction used.