I am getting a lot of profiling overhead when trying to profile my code using nvvp
(or with nvprof
Overall time is 98 ms and I'm getting 85 ms of "Instrumentation" in the first kernel launch.
How can I reduce this profiling overhead or otherwise zoom-in on just the part that I'm interested in?
I am running this with "Start execution with profiling enabled" unchecked and I've limited the profiling using cudaProfilerStart
like so:
/* --- generate data etc --- */
// Call the function once to warm up the FFT plan cache
applyConvolution( T, N, stride, plans, yData, phiW, fData, y_dwt );
gpuErrchk( cudaDeviceSynchronize() );
// Call it once for profiling
applyConvolution( T, N, stride, plans, yData, phiW, fData, y_dwt );
gpuErrchk( cudaDeviceSynchronize() );
where applyConvolution()
is the function that I'm profiling.
I am using CUDA Toolkit 8.0 on Ubuntu 16.04 with a GTX 1080.
As I was writing up this question, I thought I'd try messing around with the profiler settings to try and preempt some potential answer-in-comment material.
To my surprise, disabling "Enable concurrent kernel profiling" got rid of the profiler overhead completely:
But perhaps this shouldn't have been that much of a surprise:
Enable concurrent kernel profiling - This option should be selected for an application that uses CUDA streams to launch kernels that can execute concurrently. If the application uses only a single stream (and therefore cannot have concurrent kernel execution), deselecting this option may decrease profiling overhead.
(taken from http://docs.nvidia.com/cuda/profiler-users-guide/)
An earlier version of the CUDA Profiler User's Guide also noted in a "Profiling Limitations" section that:
Concurrent kernel mode can add significant overhead if used on kernels that execute a large number of blocks and that have short execution durations.
Oh well. Posting this question/answer anyways in case it helps someone else avoid this annoyance.