Estimated transactions on coalesced memory accesses

~~I've queried the CUDA device (T1000 SM_75) and picked the values of some specific CUDA device attributes as follows. (Note: this question is a little bit lengthy ☺.)~~

#include <cuda.h> #include <stdio.h> int main() { cuInit(0); CUdevice dev; cuDeviceGet(&dev, 0); int val; cuDeviceGetAttribute(&val, CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT, dev); printf("SM count: %d\n", val); cuDeviceGetAttribute(&val, CU_DEVICE_ATTRIBUTE_L2_CACHE_SIZE, dev); printf("L2 cache size: %dM bytes\n\n", val >> 20); cuDeviceGetAttribute(&val, CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR, dev); printf("Max registers per SM: %dK 32-bit\n", val >> 10); cuDeviceGetAttribute(&val, CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR, dev); printf("Max shared memory per SM: %dK bytes\n\n", val >> 10); size_t bytes; cuDeviceTotalMem(&bytes, dev); printf("Total memory: %luG bytes\n", bytes / 1000000000); cuDeviceGetAttribute(&val, CU_DEVICE_ATTRIBUTE_GLOBAL_MEMORY_BUS_WIDTH, dev); printf("Global memory bus width: %d-bit\n\n", val); }

SM count: 14 L2 cache size: 1M bytes Max registers per SM: 64K 32-bit Max shared memory per SM: 64K bytes Total memory: 4G bytes Global memory bus width: 128-bit

I've got serval related questions below,

Is the 1M L2 cache allocated across the 14 SMs?

Is the 64K registers allocated across the 14 SMs, or does it imply total registers 64K * 14?

Is the 64K shared memory allocated across the 14 SMs, or does it imply total shared memory 64K * 14?

~~Does the 128-bit global memory bus width apply to local memory too?~~

My GPU (T1000 SM75) has a memory bus of 128-bit width.

When we try to estimate transactions on coalesced memory accesses, the CUDA device attributes contains nothing on L1 and L2 cache line sizes. However, does the following quotes from a previous article on cache statistics stay true even now?

Loads from the caches are made via transactions of a fixed size. L1 transactions are 128 bytes, and L2 are 32 bytes.

Global memory accesses are routed either through L1 and L2, or only L2, depending on the architecture and the type of instructions used. Local memory is routed through L1 and L2 cache. Shared memory accesses do not go through any cache.

Here are the next quotes from the CUDA programming guide on memory access here and there,

Global memory resides in device memory and device memory is accessed via 32-, 64-, or 128-byte memory transactions.

Global memory accesses are always cached in L2.

For caching example, when each of the 32 threads within a warp requires 4-byte memory access to perfectly aligned address space, the warp will coalesce 128-byte (32 * 4-byte) memory access. Hence, will the warp issue 1 time 128-byte global memory transaction and 4 times 32-byte L2 cache transactions sequentially? Does a 128-byte (128 * 8-bit) global memory transaction require 8 consecutive physical accesses to the memory bus of 128-bit width?

Below are the last quotes from the CUDA programming guide on shared memory access here and there,

To achieve high bandwidth, shared memory is divided into equally-sized memory modules, called banks, which can be accessed simultaneously. Any memory read or write request made of n addresses that fall in n distinct memory banks can therefore be serviced simultaneously, yielding an overall bandwidth that is n times as high as the bandwidth of a single module.

Shared memory has 32 banks that are organized such that successive 32-bit words map to successive banks. Each bank has a bandwidth of 32 bits per clock cycle.

Similarly for shared memory example, when each of the 32 threads within a warp requires 4-byte memory access to perfectly aligned address space, the warp will coalesce 128-byte (32 * 4-byte) memory access. Hence, will the warp issue 1 time 128-byte global memory transaction, and 4 times 32-byte L2 cache transactions and 1 time 128-byte (32 * 32-bit) shared memory transaction sequentially?

Solution

Is the 1M L2 cache allocated across the 14 SMs?

Yes, the L2 cache is a device-wide resource. You can get an idea of this by looking at any of the architecture whitepapers, such as this one for turing/cc7.5 GPUs.

Is the 64K registers allocated across the 14 SMs, or does it imply total registers 64K * 14?

Each SM has 64K registers. This can be found out in the whitepaper or by studying the architecture sections of the programming guide

Is the 64K shared memory allocated across the 14 SMs, or does it imply total shared memory 64K * 14?

Shared memory is also a per-sm resource. The same resources mentioned previously will clarify this for you (architecture whitepaper, arch sections of the programming guide). So the total is 64K times the number of SMs in that GPU, for that arch type (cc7.5).

Does the 128-bit global memory bus width apply to local memory too?

Local memory is a logical resource that can manifest (i.e. be physically backed) in registers, cache, or device memory. When it manifests in device memory, the same memory bus is used to retrieve it as would be used to retrieve items in the logical global space.

Does the following quotes from a previous article on cache statistics stay true even now?

The L1 and L2 granularities since Pascal (cc6.0) are minimum of 32-bytes each. The remaining statements there are basically correct.

Is it correct for this basic calculation to fulfill the requirement?

Yes, that is basically correct. One possible efficient path for usage of memory is to have each thread in the warp load a 4-byte adjacent quantity. Having that block of data be aligned to a 128-byte boundary may also be beneficial, but is often of less concern. Unit 4 of this online training series will delve into all the nuances here in more detail.

Does a 128-byte (128 * 8-bit) global memory transaction require 8 consecutive physical accesses to the memory bus of 128-bit width?

That's not the way to think about it. You're using "global memory" here (which is a logical space definition) where I think you really mean "device memory". The minimum access granularity from device memory on most GPUs is 32-bytes. These memory "segments" are "aligned" to the divisions implied by the L2 cache "sectors". A single 32-byte segment in memory could be retrieved in a single operation, occupying whatever bus width may exist, at least up to 256 bits wide. Multiple segments to be retrieved could occupy even wider memory bus widths, on GPUs that may have those.

Is this calculation correct, too?

Yes, generally speaking, shared memory efficiency will be highest using a similar pattern as what you would use for high efficiency from device memory. However shared memory allows for more flexible access patterns (one 32-bit item per "bank") to achieve high efficiency, which goes beyond what is possible for highest efficiency access to device (off-chip DRAM) memory. We don't use the words "coalesce" to describe shared memory activity, nor do we refer to caches w.r.t shared memory. Efficient use of shared memory is a fairly involved topic, but is covered in numerous posts here on the cuda tag, and furthermore is covered in the same training unit I previously linked.