cudanvprof

dram_write_bytes result on P100


I used nvprof to profile a simple vecadd example (n=1024) on P100 but observed the dram_write_bytes is only 256 (rather than 1024*4 that I expected). Can someone explain why this number is small? What other metrics I need to add in to count for global memory writes? Thanks. float_count_sp number is correct (1024).

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

__global__ void vecAdd(float* a, float* b, float* c, int n){
    int id = blockIdx.x*blockDim.x + threadIdx.x;
    if(id < n) c[id] = a[id] + b[id];
}

int main(int argc, char* argv[]){
    int n = 1024;
    float *h_a, *d_a;
    float *h_b, *d_b;
    float *h_c, *d_c;
    size_t bytes = n*sizeof(float);
    h_a = (float*)malloc(bytes);
    h_b = (float*)malloc(bytes);
    h_c = (float*)malloc(bytes);
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);
    
    int i;
    for(i = 0; i < n; i++){
        h_a[i] = sin(i)*sin(i);
        h_b[i] = cos(i)*cos(i+1);
    }
    cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);
    vecAdd <<<1, 1024>>> (d_a, d_b, d_c, n);
    cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);
    
    float sum = 0;
    for(i = 0; i < n; i++)
        sum += h_c[i] - h_a[i] - h_b[i];
    printf("final diff: %f\n", sum/n);
    
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);
    free(h_a);
    free(h_b);
    free(h_c);
    
    return 0;
}

Is it related to the sampling of nvprof? One time I get 384 bytes. Sometimes I even got 0 bytes. Weird thing is: if I change n to 1024*1024, I got bytes more than I expected (4688032). 4688032/1024/1024/4 = 1.11.


Solution

  • There are several reasons why your expectations are not being observed and the data is changing:

    1. The GPU memory system is shared by all engines. The primary engine the is the graphics/compute engine but other engines such as copy engines, display, etc. access the device memory and the memory control (FB = framebuffer) counters do not have a method to track the requester.

    2. NVPROF injection does not attempt to evict all context memory from the L2 cache. The cudaMemcpys prior to the launch and the kernel replay code in nvprof will leave the L2 cache in an inconsistent state.

    3. The initial size of 4KB is simply to small to accurately track. The full data set could be in L2 from either the cudaMemcpy or replay. Furthermore, the bytes you see can be from other clients such as the constant caches.

    It is highly recommends you scale the buffer size to a reasonable size. On newer GPUs the Nsight Compute profiler has improved L2 level breakdown of various clients to help detect unexpected traffic. In addition Nsight Compute replay logic clears the L2 cache so that each replay has a consistent start state.

    If you have a monitor attached it is recommended to move the monitor to a different GPU when looking at DRAM counters. nvprof L2 counters generally filter the count by traffic from the SMs so traffic from copy engines, the display controller, MMU, constant caches, etc. will not show up in the L2 counters.