What is the correct option for measuring bandwidth using nvprof --metrics from the command line? I am using flop_dp_efficiency to get the percentage of peak FLOPS, but there seems to be many options for bandwidth measurement in the manual that I don't really understand what I am measuring. e.g. dram_read, dram_write, gld_read, gld_write all look the same to me. Also, should I report bandwdith as a sum of read+write throughput by assuming both happen simultaneously ?
Edit:
Based on the excellent answer with the diagram, what would be the bandwidth going from the device memory to the kernel ? I am thinking to take the minimum of the bandwidth (read+write) on the path from the kernel to the device memory, which is probably dram to L2 cache.
I am trying to determine if a kernel is compute- or memory- bound by measuring FLOPS and bandwidth.
In order to understand the profiler metrics in this area, it's necessary to have an understanding of the memory model in a GPU. I find this diagram (which used to be published in the Nsight Visual Studio edition documentation) to be useful (a similar chart is now part of the nsight compute documentation). I have marked up the diagram with numbered arrows which refer to the numbered metrics (and direction of transfer) I have listed below:
Please refer to the CUDA profiler metrics reference for a description of each metric:
Notes:
With the above description, it's possible your question still may not be answered. It would then be necessary for you to clarify your request -- "what do you want to measure exactly?" However based on your question as written, you probably want to look at the dram_xxx metrics, if what you care about is actual consumed memory bandwidth.
Also, if you are simply trying to get an estimate of the maximum available memory bandwidth, using the CUDA sample code bandwidthTest
is probably the easiest way to get a proxy measurement for that. Just use the reported device to device bandwidth number, as an estimate of the maximum memory bandwidth available to your code.
Combining the above ideas, the dram_utilization metric gives a scaled result that represents the portion (from 0 to 10) of the total available memory bandwidth that was actually used.