I'm trying to understand whether a GPU's Last-Level Cache is invalidated or preserved across multiple kernel launches, so that the effective memory bandwidth can be increased. I'm aware that this possibly depends on the specific GPU architecture. If it's indeed preserved at least on some GPU architectures, perhaps the kernels can be carefully written as a way to exploit this cache as a communication buffer (when kernel fusion is not feasible).
However, currently the answer on the Web is unclear and contradictory, many are also outdated. I found a couple of posts on the Nvidia and AMD developer forums without a clear answer, the best was a suggestion to measure it using micro-benchmarks. On Stack Exchange, there are also several questions:
In the question NVidia CUDA: cache L2 and multiple kernel invocations from 2011, Zk1001 answered that:
Assuming you are talking about L2 data cache in Fermi. I think the caches are flushed after each kernel invocation. In my experience, running two consecutive launches of the same kernel with a lots of memory accesses (and #L2 cache misses) doesn't make any substantial changes to the L1/L2 cache statistics.
This answer only states that when the size of the working set is large, L2 cache is too small for any temporal locality. This is consistent with my observation (it's true even on the CPU), but it doesn't answer anything about whether the cache is persistent or not.
In the question How does cache affect while a same kernel is being launched repeatedly from 2016, Melissa P answered that:
For AMD Radeon GCNs, L1 and L2 cache is persistent between all kernels and all different kernels. A kernel can use cached data from any other kernel. Additionally, Local Memory inside a Compute Unit is not cleared/zeroed between kernel runs (more precisely, between work-group runs). This means you have to initialize local variables. The same should apply for nVidia/CUDA devices and generic SIMD CPUs.
That being said, OpenCL does not know or define different level of caches, caches are vendor specific. Any functionality that handles or manages caching is a vendor specific extension.
But without any citation.
In Nvidia's NVIDIA A100 Tensor Core GPU Architecture whitepaper, Nvidia states:
Alongside the raw data bandwidth improvements, A100 improves data fetch efficiency and reduces DRAM bandwidth demand with a 40 MB L2 cache that is almost 7x larger than that of Tesla V100. To fully exploit the L2 capacity A100 includes improved cache management controls. Optimized for neural network training and inferencing as well as general compute workloads, the new controls ensure that data in the cache is used more efficiently by minimizing writebacks to memory and keeping reused data in L2 to reduce redundant DRAM traffic.
For example, for DL inferencing workloads, ping-pong buffers can be persistently cached in the L2 for faster data access, while also avoiding writebacks to DRAM. For producer-consumer chains, such as those found in DL training, L2 cache controls can optimize caching across the write-to-read data dependencies. In LSTM networks, recurrent weights that are shared across multiple GEMM operations can be preferentially cached and reused in L2.
It appears that persistent Last-Level Cache is at least supported on the Nvidia A100, but it's unclear whether other GPU architectures support the same feature.
As GPUs are started to include more Last-Level Cache, such as the 80 MiB cache in Nvidia A100, 128 MiB cache in AMD RDNA2, and 96 MiB cache in AMD RDNA3, using the Last-Level Cache as a communication buffer across kernels is becoming at least a theoretically feasible idea. So, is Last-Level Cache invalidation behavior across kernel launches implemented in GPU architectures?
The L2 cache is a proxy for any accesses to device memory, on any current CUDA GPU. Device memory is the backing for logical global space accesses that are backed by device memory (yes, that is a bit circular) as well as logical local space accesses. As far as I know it has always been this way, for any compute capability since the first one that incorporated an L2 cache (cc2.x).
The L2 cache on a CUDA GPU is not invalidated between kernel launches. This is fairly easy to demonstrate. I happened to write a "casual" code to measure L2 bandwdith here recently. That plus some knowledge about the nsight compute profiler should be all that is necessary to confirm this on any CUDA GPU of interest.
Here is a test case, which should be usable pretty much as-is on any CUDA GPU:
# cat t37.cu
#include <iostream>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
// find largest power of 2
unsigned flp2(unsigned x) {
x = x| (x>>1);
x = x| (x>>2);
x = x| (x>>4);
x = x| (x>>8);
x = x| (x>>16);
return x - (x>>1);
}
unsigned long long dtime_usec(unsigned long long start=0){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
using mt = unsigned long long;
__global__ void k(mt *d, mt *d2, int len, int lps){
for (int l = 0; l < lps; l++)
for (int i = threadIdx.x+blockDim.x*blockIdx.x; i<len; i+=gridDim.x*blockDim.x)
d[i] = __ldcg(d2+i);
}
int main(){
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
const int nTPSM = prop.maxThreadsPerMultiProcessor;
const int nSM = prop.multiProcessorCount;
const unsigned l2size = prop.l2CacheSize;
unsigned sz = flp2(l2size)/2;
sz = sz/sizeof(mt); // approx 1/2 the size of the L2
const int nTPB = 512; // block size
const int nBLK = (nSM*nTPSM)/nTPB;
const int loops = 100;
mt *d, *d2;
cudaMalloc(&d, sz*sizeof(mt));
cudaMalloc(&d2, sz*sizeof(mt));
k<<<nBLK, nTPB>>>(d, d2, sz, 1); // warm-up
cudaDeviceSynchronize();
unsigned long long dt = dtime_usec(0);
k<<<nBLK, nTPB>>>(d, d2, sz, loops);
cudaDeviceSynchronize();
dt = dtime_usec(dt);
std::cout << "bw: " << (sz*2*sizeof(mt)*loops)/(float)dt << "MB/s" << std::endl;
}
# nvcc -o t37 t37.cu
# ./t37
bw: 2.20318e+06MB/s
# ncu --cache-control none --metrics dram__bytes_read.sum t37
==WARNING== Note: Running with uncontrolled GPU caches. Profiling results may be inconsistent.
==PROF== Connected to process 137724 (/root/bobc/t37)
==PROF== Profiling "k" - 0: 0%....50%....100% - 1 pass
==PROF== Profiling "k" - 1: 0%....50%....100% - 1 pass
bw: 210214MB/s
==PROF== Disconnected from process 137724
[137724] t37@127.0.0.1
k(unsigned long long *, unsigned long long *, int, int) (174, 1, 1)x(512, 1, 1), Context 1, Stream 7, Device 0, CC 8.9
Section: Command line profiler metrics
-------------------- ----------- ------------
Metric Name Metric Unit Metric Value
-------------------- ----------- ------------
dram__bytes_read.sum Mbyte 18.39
-------------------- ----------- ------------
k(unsigned long long *, unsigned long long *, int, int) (174, 1, 1)x(512, 1, 1), Context 1, Stream 7, Device 0, CC 8.9
Section: Command line profiler metrics
-------------------- ----------- ------------
Metric Name Metric Unit Metric Value
-------------------- ----------- ------------
dram__bytes_read.sum byte 0
-------------------- ----------- ------------
#
(L4 GPU, CUDA 12.2)
We see that on the first invocation of the copy kernel, the profiler reports ~18MB loaded from DRAM. On the second invocation, the profiler reports 0 bytes loaded from DRAM. The only way this is possible is if the L2 is not invalidated in-between the 2 kernel launches.
You may have noted my use of the --cache-control none
switch on nsight compute. That tells nsight compute not to modify cache state in between kernel launches, and is documented here. Taking a look at the option description there:
Control the behavior of the GPU caches during profiling. Allowed values: all: All GPU caches are flushed before each kernel replay iteration during profiling. While metric values in the execution environment of the application might be slightly different without invalidating the caches, this mode offers the most reproducible metric results across the replay passes and also across multiple runs of the target application. none: No GPU caches are flushed during profiling. This can improve performance and better replicates the application behavior if only a single kernel replay pass is necessary for metric collection. However, some metric results will vary depending on prior GPU work, and between replay iterations. This can lead to inconsistent and out-of-bounds metric values.
The reason it can "better replicate application behavior" is due to the fact that when you are not running the profiler, the caches are not invalidated in-between kernel launches.
The idea of "in-between" kernel launches is poorly defined in a concurrent scenario where we have multiple kernels overlapping. Suggesting that the L2 cache is invalidated either at the end of a kernel launch or the beginning of another would suggest also that it could be happening in the midst of another kernel activity.