Where is the boundary of start and end of CPU launch and GPU launch of Nvidia Profiling NVPROF?

What is the definition of start and end of kernel launch in the CPU and GPU (yellow block)? Where is the boundary between them?

Please notice that the start, end, and duration of those yellow blocks in CPU and GPU are different.Why CPU invocation of vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n); takes that long time?

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// CUDA kernel. Each thread takes care of one element of c
__global__ void vecAdd(double *a, double *b, double *c, int n)
{
    // Get our global thread ID
    int id = blockIdx.x*blockDim.x+threadIdx.x;
    //printf("id = %d \n", id);

    // Make sure we do not go out of bounds
    if (id < n)
        c[id] = a[id] + b[id];
}

int main( int argc, char* argv[] )
{
    // Size of vectors
    int n = 1000000;

    // Host input vectors
    double *h_a;
    double *h_b;
    //Host output vector
    double *h_c;

    // Device input vectors
    double *d_a;
    double *d_b;
    //Device output vector
    double *d_c;

    // Size, in bytes, of each vector
    size_t bytes = n*sizeof(double);

    // Allocate memory for each vector on host
    h_a = (double*)malloc(bytes);
    h_b = (double*)malloc(bytes);
    h_c = (double*)malloc(bytes);

    // Allocate memory for each vector on GPU
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);

    int i;
    // Initialize vectors on host
    for( i = 0; i < n; i++ ) {
        h_a[i] = sin(i)*sin(i);
        h_b[i] = cos(i)*cos(i);
    }

    // Copy host vectors to device
    cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);

    int blockSize, gridSize;

    // Number of threads in each thread block
    blockSize = 1024;

    // Number of thread blocks in grid
    gridSize = (int)ceil((float)n/blockSize);

    // Execute the kernel
    vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);

    // Copy array back to host
    cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );

    // Sum up vector c and print result divided by n, this should equal 1 within error
    double sum = 0;
    for(i=0; i<n; i++)
        sum += h_c[i];
    printf("final result: %f\n", sum/n);

    // Release device memory
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);

    // Release host memory
    free(h_a);
    free(h_b);
    free(h_c);

    return 0;
}

CPU yellow block:

GPU yellow block:

Solution

Note that you mention NVPROF but the pictures you are showing are from nvvp - the visual profiler. nvprof is the command-line profiler

GPU Kernel launches are asynchronous. That means that the CPU thread launches the kernel but does not wait for the kernel to complete. In fact, the CPU activity is actually placing the kernel in a launch queue - the actual execution of the kernel may be delayed if anything else is happening on the GPU.

So there is no defined relationship between the CPU (API) activity, and the GPU activity with respect to time, except that the CPU kernel launch must obviously precede (at least slightly) the GPU kernel execution.

The CPU (API) yellow block represents the duration of time that the CPU thread spends in a library call into the CUDA Runtime library, to launch the kernel (i.e. place it in the launch queue). This library call activity usually has some time overhead associated with it, in the range of 5-50 microseconds. The start of this period is marked by the start of the call into the library. The end of this period is marked by the time at which the library returns control to your code (i.e. your next line of code after the kernel launch).

The GPU yellow block represents the actual time period during which the kernel was executing on the GPU. The start and end of this yellow block are marked by the start and end of kernel activity on the GPU. The duration here is a function of what the code in your kernel is doing, and how long it takes.

I don't think the exact reason why a GPU kernel launch takes ~5-50 microseconds of CPU time is documented or explained anywhere in an authoritative fashion, and it is a closed source library, so you will need to acknowledge that overhead as something you have little control over. If you design kernels that run for a long time and do a lot of work, this overhead can become insignificant.