nvvp

Excessive profiler overhead with NVidia Visual Profiler


I am getting a lot of profiling overhead when trying to profile my code using nvvp (or with nvprof):

nvvp timeline with 85ms of overhead in 98ms of total runtime Overall time is 98 ms and I'm getting 85 ms of "Instrumentation" in the first kernel launch.

How can I reduce this profiling overhead or otherwise zoom-in on just the part that I'm interested in?

Background

I am running this with "Start execution with profiling enabled" unchecked and I've limited the profiling using cudaProfilerStart/cudaProfilerStop like so:

/* --- generate data etc --- */
// Call the function once to warm up the FFT plan cache
applyConvolution( T, N, stride, plans, yData, phiW, fData, y_dwt );
gpuErrchk( cudaDeviceSynchronize() );
// Call it once for profiling
cudaProfilerStart();
applyConvolution( T, N, stride, plans, yData, phiW, fData, y_dwt );
gpuErrchk( cudaDeviceSynchronize() );
cudaProfilerStop();

where applyConvolution() is the function that I'm profiling.

I am using CUDA Toolkit 8.0 on Ubuntu 16.04 with a GTX 1080.


Solution

  • As I was writing up this question, I thought I'd try messing around with the profiler settings to try and preempt some potential answer-in-comment material.

    To my surprise, disabling "Enable concurrent kernel profiling" got rid of the profiler overhead completely:

    enter image description here

    But perhaps this shouldn't have been that much of a surprise:

    Enable concurrent kernel profiling - This option should be selected for an application that uses CUDA streams to launch kernels that can execute concurrently. If the application uses only a single stream (and therefore cannot have concurrent kernel execution), deselecting this option may decrease profiling overhead.

    (taken from http://docs.nvidia.com/cuda/profiler-users-guide/)

    An earlier version of the CUDA Profiler User's Guide also noted in a "Profiling Limitations" section that:

    Concurrent kernel mode can add significant overhead if used on kernels that execute a large number of blocks and that have short execution durations.

    Oh well. Posting this question/answer anyways in case it helps someone else avoid this annoyance.