I want to test the register bandwidth of an NVIDIA GPU (OpenCL/CUDA). How to do that?
I can't find any information about the register bandwidth test on the Internet, only the bandwidth test of the cache at all levels.
Registers have 0 clock cycle access latency and are bound to the GPU core clock frequency.
Say a GPU has 10 TFlops/s compute throughput with FP32 fused-multiply-add instructions. Each FMA instruction does 2 Flops, loads 3 FP32 inputs from registers and writes 1 FP32 output in registers. Each FP32 number is 4 Bytes. That makes 5 Trillion FMA calls per second, accessing 20 Trillion FP32 numbers per second, with a combined register bandwidth of 80TB/s.
So GPU register bandwidth is (TFlops/s for FP32) × (8 Byte/Flop). This is valid for all GPUs.
The FP32 TFlops/s you can measure for example with this OpenCL-Benchmark tool.