[SOLVED] Load/Store caching of NVIDIA GPU

Even the ancient Fermi architecture (compute capability 2.x) cached stores in L2:

Fermi features a 768 KB unified L2 cache that services all load, store, and texture requests.
_{NVIDIA Fermi Compute Architecture Whitepaper (PDF) (emphasis mine)}

So the book seems to be talking about write-caching in L1 data cache specifically.

The short answer regarding write-caching in L1 is that since the Volta architecture (compute capability 7.0, newer than the "Professional CUDA C Programming" book) stores can certainly be cached in L1:

The GV100 L1 cache improves performance in a variety of situations where shared memory is not the best choice or is unable to work. With Volta GV100, the merging of shared memory and L1 delivers a high-speed path to global memory capable of streaming access with unlimited cache- misses in flight. Prior NVIDIA GPUs only performed load caching, while GV100 introduces write-caching (caching of store operations) to further improve performance.
_{Volta Architecture Whitepaper (PDF) (emphasis mine)}

Like Volta, Turing’s L1 can cache write operations (write-through). The result is that for many applications Volta and Turing narrow the performance gap between explicitly managed shared memory and direct access to device memory.
_{Turing Tuning Guide from CUDA Toolkit Documentation 12.6 Update 3 (emphasis mine)}

But there is inconsistent information in the Programming Guide:

Compute Capability 5.x - Global Memory

[...] Data that is not read-only for the entire lifetime of the kernel cannot be cached in the unified L1/texture cache for devices of compute capability 5.0. [...]

Compute Capability 7.x - Global Memory

Global memory behaves the same way as in devices of compute capability 5.x (See Global Memory [section for C.C. 5.x]).
_{CUDA C++ Programming Guide from CUDA Toolkit Documentation 12.6 Update 3}

This seems to be the cause of a lot of confusion. I will report this discrepancy between Programming Guide and Tuning Guides as a bug and report back.

All architectures since Volta/Turing feature the same on-chip unified data cache architecture. There is nothing to be found in their tuning guides regarding what is cached in L1, so I assume that these newer architectures (Ampere, Ada, Hopper and Blackwell) behave the same way as Volta/Turing with regards to caching global loads and stores in L1.

Understanding the zoo of caches and their names and behaviors can be hard without understanding their history therefore I compiled the following overview:

A History of Caching for Nvidia Architectures

Shorthands:

LD/ST: ordinary load/store.
LDU: Loading global memory via a const T* __restrict__ where the address is independent of threadIdx (Uniform, see CUDA C Programming Guide 4.0 (PDF, on Hunter College website): F.4.4).
LDG: Loading global memory via a const T* __restrict__ or directly via __ldg(); not equivalent to the SASS global load instruction LDG.
ROC: read-only constant cache (aka uniform cache because it can only broadcast/is optimized for uniform access), generally only caching loads from __constant__ memory.
ROD: read-only data cache (aka texture cache or unified L1/TEX since Maxwell)

For context, pre-Volta architectures usually did not even cache global loads in L1 cache. Instead one could use the on-chip ROD for some read-caching of global memory via LDG since Kepler. If I interpret the old docs correctly, this replaced the Fermi feature of uniform read-only global loads (LDU) going through ROC.

It seems the reason for this decision only to cache read-only data by default was that developers were expected to manually cache data in shared memory when it was accessed more than once. In contrast to shared memory, global memory was seen as the way to communicate between thread-blocks and therefore between SMs. In this context it makes sense not to cache writable global memory on the SM as it could cause bugs with stale caches as there was and is no coherency mechanism for L1.

In its most feature-bare form (not even local memory caching) on Maxwell, the unified L1/TEX only staged writable memory for coalescing which corresponds to OP's second citation from the "Professional CUDA C Programming" book, at least for loads:

The unified L1/texture cache acts as a coalescing buffer for memory accesses, gathering up the data requested by the threads of a warp prior to delivery of that data to the warp. This function previously was served by the separate L1 cache in Fermi and Kepler.
_{Maxwell Tuning Guide from CUDA Toolkit Documentation 12.6 Update 3}

I.e. staging data through L1 does not mean making in available to any future load from the same address. I imagine this as cache lines being instantly invalidated once consumed by the SM.

When designing the Volta architecture with its (re-) unified L1/TEX + shared memory, Nvidia seemingly put a higher priority on the ability to do fast prototyping of CUDA kernels and get passable performance without manually caching in shared memory. One can argue that communication between SMs via global memory is a rather advanced technique and developers using it should know how to avoid reading stale data from L1.

Architecture	Tesla	Fermi	Kepler	Maxwell	Pascal	<-	Volta and newer
Compute Capability	1.x	2.x	3.x	5.x	6.0	6.y	7.0 - 12.0
unified L1 + shared memory		✓	✓				✓
unified L1 + ROD = L1/TEX				✓	✓	✓	✓
local LD/ST -> L1		✓	✓		✓	✓	✓
constant LD -> ROC	✓	✓	✓	✓	✓	✓	✓
texture LD -> TEX	✓	✓	✓	✓	✓	✓	✓
global LDU -> ROC		✓
global LDG -> ROD/TEX			✓	✓	✓	✓	✓
global LD -> L1		✓¹	²	²	✓	²	✓
global ST -> L1							✓
all LD/ST -> L2		✓	✓	✓	✓	✓	✓

^{¹: Can be disabled via -Xptxas -dlcm=cg (See CUDA C Programming Guide 4.0 (PDF, on Hunter College website): F.4.2).}
^{²: Can be enabled via -Xptxas -dlcm=ca for certain GPUs (Kepler GK110B, GK20A, GK210, Maxwell GM204 and Pascal GP104).}

Looking at the assembly generated from LDG for Volta and newer, it causes the .nc (non-coherent) modifier to the ld.global in PTX which is still reflected in SASS as well (.CONSTANT, has nothing to do with constant memory which has its own SASS load instructions). So for some reason LDG is still handled differently from a ordinary global load even though both should go through the same unified L1/TEX cache which unlike the partitioning off of shared memory doesn't feature a (documented) separation for its L1 and TEX capabilities. Maybe this is just about priority/cache eviction policy?

Cache Operators/Hints

For a deeper dive, take a look at the PTX ISA's "Cache Operators for Memory Store Instructions" (also available as CUDA C++ intrinsics called "Store Functions Using Cache Hints"):

Operator Meaning

.wb Cache write-back all coherent levels. The default store instruction cache operation is st.wb, which writes back cache lines of coherent cache levels with normal eviction policy. [...]

.cg Cache at global level (cache in L2 and below, not L1). Use st.cg to cache global store data only globally, bypassing the L1 cache, and cache only in the L2 cache.

.cs Cache streaming, likely to be accessed once. The st.cs store cached-streaming operation allocates cache lines with evict-first policy to limit cache pollution by streaming output data.

.wt Cache write-through (to system memory). The st.wt store write-through operation applied to a global System Memory address writes through the L2 cache.

_{Parallel Thread Execution ISA Version 8.5 from CUDA Toolkit Documentation 12.6 Update 3}

Operator	Meaning
`.wb`	Cache write-back all coherent levels. The default store instruction cache operation is `st.wb`, which writes back cache lines of coherent cache levels with normal eviction policy. [...]
`.cg`	Cache at global level (cache in L2 and below, not L1). Use `st.cg` to cache global store data only globally, bypassing the L1 cache, and cache only in the L2 cache.
`.cs`	Cache streaming, likely to be accessed once. The `st.cs` store cached-streaming operation allocates cache lines with evict-first policy to limit cache pollution by streaming output data.
`.wt`	Cache write-through (to system memory). The `st.wt` store write-through operation applied to a global System Memory address writes through the L2 cache.

Edit: After having found the original version of this table in the PTX ISA for Fermi and reading the additional Fermi-specific information in that version, I now know that st.wb/st.wt vs st.cg was introduced primarily for st.local and not st.global:

Operator Meaning

.wb [...] Data stored to local per-thread memory is cached in L1 and L2 with with write-back. However, sm_20 does NOT cache global store data in L1 because multiple L1 caches are not coherent for global data. Global stores bypass L1, and discard any L1 lines that match, regardless of the cache operation. Future GPUs may have globally-coherent L1 caches, in which case st.wb could write-back global store data from L1.

.cg [...] In sm_20, st.cg is the same as st.wb for global data, but st.cg to local memory uses the L1 cache, and marks local L1 lines evict-first.

.cs [...]

.wt [...] The st.wt store write-through operation applied to a global System Memory address writes through the L2 cache, to allow a CPU program to poll a SysMem location written by the GPU with st.wt. Addresses not in System Memory use normal write-back.

_{Parallel Thread Execution
ISA Version 2.3 (PDF, on Drexel University website) originally from CUDA Toolkit 4.0}

Operator	Meaning
`.wb`	[...] Data stored to local per-thread memory is cached in L1 and L2 with with write-back. However, `sm_20` does NOT cache global store data in L1 because multiple L1 caches are not coherent for global data. Global stores bypass L1, and discard any L1 lines that match, regardless of the cache operation. Future GPUs may have globally-coherent L1 caches, in which case `st.wb` could write-back global store data from L1.
`.cg`	[...] In `sm_20`, `st.cg` is the same as `st.wb` for global data, but `st.cg` to local memory uses the L1 cache, and marks local L1 lines evict-first.
`.cs`	[...]
`.wt`	[...] The `st.wt` store write-through operation applied to a global System Memory address writes through the L2 cache, to allow a CPU program to poll a SysMem location written by the GPU with `st.wt`. Addresses not in System Memory use normal write-back.

Due to the thread-private nature of local memory, the per-SM L1 cache is coherent for local accesses. While my previous interpretation (see below) might still be correct for global memory access on Volta and newer GPUs, my arguments were flawed, so take the following with a grain of salt.

Previous Interpretation of Cache Operators/Hints Considering Only `st.global` but Not `st.local`

This table is written confusingly (maybe to avoid describing architecture-specific behavior) but given the information we already have about L1 write-caching, the best interpretation I can come up with is that st.wb and st.wt concern how the write is handled by L2 while leaving L1 write-caching up to the particular architecture as L1 is not a coherent level and does probably not contain the necessary logic to implement write-back. As the description for st.wb does not concern the handling in non-coherent levels at all this is fine.

One can think of the L1 write-caching as always write-through (i.e. eagerly writing to the next level) but with invalidation of the L1 cache-line on pre-Volta architectures which is not what one typically thinks of for "write-through" but should still be fine.

st.cg explicitly disallows caching in L1, i.e. it should always reproduce the behavior of pre-Volta architectures. And st.cs does not mention the cache levels at all and just determines the eviction policy. This interpretation agrees with the one given at Making better sense of the PTX store caching modes (assuming a Volta or later architecture).

So stores are always cached in L2 while L1 write-caching depends on the GPU architecture and the actual store-instruction used.

Load/Store caching of NVIDIA GPU

Compute Capability 5.x - Global Memory

Compute Capability 7.x - Global Memory

A History of Caching for Nvidia Architectures

Cache Operators/Hints

Previous Interpretation of Cache Operators/Hints Considering Only st.global but Not st.local

Previous Interpretation of Cache Operators/Hints Considering Only `st.global` but Not `st.local`