caching cuda gpu gpgpu gpu-shared-memory

In V100 GPU or A100 GPU, CUDA COREs- data movement path - where do they look first for data in Shared Memory or L1 cache

I assume in context to data fetch for CUDA core - registers are the fastest, next shared memory , then L1 cache, next L2 cache and then global memory is the slowest.

I assume in a GPU data moves in the following hierarchy -

GLOBAL MEMORY --> L2 cache --> L1 cache --> Shared memory --> Registers --> CUDA CORE

Question 1: If a CUDA core does not finds the requested data in the registers, where does the CUDA core next looks for the data - in the shared memory or L1 cache? ChatGpt mentions, for A100 or V100 GPU it will look for data in L1 cache and if it does not finds requested data in L1 cache, next it will look for requested data in shared memory. But this does not sounds correct to me. since shared memory is faster than L1 cache, so in my logical opinion once a CUDA core is not able to find requested data in registers, a CUDA core should look next, for the data in the next fastest memory(shared memory) before looking for requested data in L1 cache. So could you please let me know if I am correct, if not correct can you please let me know what I am missing?

Question 2: Suppose in my CUDA kernel, I do not declare any shared memory arrays and instead use device arrays. I visualize data movement hierarchy in a GPU in the following hierarchy

GLOBAL MEMORY --> L2 cache --> L1 cache --> Shared memory --> Registers --> CUDA CORE

Question - Scenario a ) - When the CUDA core requests for data, will the data be stored in the shared memory even if we have no shared memory arrays declared and are only using device arrays? I ask this because I visulize data movement hierarchy where Shared Memory is between L1 cache and Registers during moving data between different memory hierarchy in GPU. If I am not right in my understanding, could you please let me know what I am missing?

Question - Scenario b ) - in the scenario when we declare no shared memory array and only declare device array, suppose the data does not gets stored on shared memory and only gets stored in L1 cache. If this is true, is the shared memory switched off when no shared arrays are declared and the data moves directly from L1 cache to registers?

Solution

CUDA GPUs mostly adhere to a load-store architecture.

CUDA GPUs have a memory map that is partitioned into separate spaces. Relevant spaces here are global, (local - registers) and shared.

CUDA GPUs all currently have a similar memory hierarchy.

The L1 or L2 cache can cache data in local, (system,) and global spaces only. The L1 and L2 cache do not cache data in the shared space.

The CUDA compiler creates instructions that explicitly or implicitly target spaces. Because it is a load-store architecture, with a few exceptions the possible movements are:

global <-> local

shared <-> local

A load-store architecture means that when any functional unit on the GPU gets issued an instruction, the operands of that instructions are registers (there are a few exceptions, they are not relevant to this discussion).

Therefore when a CUDA core (a functional unit in the GPU SM) needs data, it will always get that data from registers, and it never needs to look anywhere else.

The other relevant transactions therefore are moving data between the global space and a register (in the local space) or moving data between the shared space and a register.

The L1 and L2 cache pertain to movement between global space data and a register. They do not pertain to shared space movement and a register.

So the paths are:

global<->L2<->L1<->register (accomplished by LD, LDG, ST, STG instructions)

and

shared<->register (accomplished by LD, LDS, ST, STS instructions)

Ordinarily, to get data into shared memory, you would need to load from global to a register, then store from register to shared. This will be visible at the machine code (SASS) level as well as visible at the source code level.

Question 1: If a CUDA core does not finds the requested data in the registers, where does the CUDA core next looks for the data - in the shared memory or L1 cache?

A CUDA core will only ever operate on data in registers. It will never need to "look for" data. A LD instruction, if targetting the global space, may look in L1, then L2, then retrieve data from main memory. The retrieved data will be placed in a register. A ST instruction, if targetting the global space, places register data (ultimately) in main memory. A LD instruction, if targetting shared space, will retrieve data from shared memory only, and place that data in a register.

Question 2: Suppose in my CUDA kernel, I do not declare any shared memory arrays and instead use device arrays. I visualize data movement hierarchy in a GPU in the following hierarchy

GLOBAL MEMORY --> L2 cache --> L1 cache --> Shared memory --> Registers --> CUDA CORE

This is not correct as should now be evident. The possible path is something like:

GLOBAL MEMORY --> L2 cache --> L1 cache --> Registers --> CUDA CORE

will the data be stored in the shared memory even if we have no shared memory arrays declared and are only using device arrays?

No, if you have no __shared__ declarations in your code, the compiler will not generate any instructions that target shared memory.

is the shared memory switched off when no shared arrays are declared and the data moves directly from L1 cache to registers?

shared memory is never involved in the transfer of data from L1 to registers