[SOLVED] Way to measure the memory bandwidth for a certain instruction or code line in CUDA?

Way to measure the memory bandwidth for a certain instruction or code line in CUDA?

Is there a way to measure the memory bandwidth for a certain memory instruction or a code line in CUDA? (nvprof can output the memory bandwidth for an entire kernel.) If the clock() function is only way to do so, then what is the equation to calculate the bandwidth? (# of coalesced addresses per {instruction or code line} divided by clock() differences?)

I want to see if a certain instruction or a code line over/under-utilize the memory bandwidth. (ex, MSHR..)

I have two devices, GTX980 (Maxwell, sm_52) and P100 (Pascal, sm_60) on x86_64 bits linux system.

Solution

One tool that can give some insight is the instruction-level profiling in the nsight tool. It can give you an idea of which line to blame when the SM "stalls" (fails to issue any instruction). Because LD/ST instructions do not block execution, you often see the stall immediately following the data fetch.

Here's an NVIDIA devblog on the topic. https://devblogs.nvidia.com/parallelforall/cuda-7-5-pinpoint-performance-problems-instruction-level-profiling/