ccudaprofilingnvprof

nvprof not picking up any API calls or kernels


I'm trying to get some benchmark timings in my CUDA program with nvprof but unfortunately it doesn't seem to be profiling any API calls or kernels. I looked for a simple beginners example to make sure I was doing it right and found one on the Nvidia dev blogs here:

https://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-cuda-cc/

Code:

int main()
{
    const unsigned int N = 1048576;
    const unsigned int bytes = N * sizeof(int);
    int *h_a = (int*)malloc(bytes);
    int *d_a;
    cudaMalloc((int**)&d_a, bytes);

    memset(h_a, 0, bytes);
    cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(h_a, d_a, bytes, cudaMemcpyDeviceToHost);

    return 0;
}

Command line:

-bash-4.2$ nvcc profile.cu -o profile_test
-bash-4.2$ nvprof ./profile_test

So I replicated it word for word, line by line, and ran identical command line arguments. Unfortunately my result was the same:

-bash-4.2$ nvprof ./profile_test
==85454== NVPROF is profiling process 85454, command: ./profile_test
==85454== Profiling application: ./profile_test
==85454== Profiling result:
No kernels were profiled.

==85454== API calls:
No API activities were profiled. 

I am running Nvidia toolkit 7.5

If anyone knows what what I'm doing wrong I'd be grateful to know the answer.

-----EDIT-----

So I modified the code to be

#include<cuda_profiler_api.h>

int main()
{
    cudaProfilerStart();
    const unsigned int N = 1048576;
    const unsigned int bytes = N * sizeof(int);
    int *h_a = (int*)malloc(bytes);
    int *d_a;
    cudaMalloc((int**)&d_a, bytes);

    memset(h_a, 0, bytes);
    cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(h_a, d_a, bytes, cudaMemcpyDeviceToHost);

    cudaProfilerStop();
    return 0;
}

Unfortunately it did not change things.


Solution

  • It's a bug with unified memory profiling, the flag

    --unified-memory-profiling off  ./profile_test
    

    resolves all problems for me.