I am getting a lot of profiling overhead when trying to profile my code using nvvp
(or with nvprof
):
Overall time is 98 ms and I'm getting 85 ms of "Instrumentation" in the first kernel launch.
How can I reduce this profiling overhead or otherwise zoom-in on just the part that I'm interested in?
I am running this with "Start execution with profiling enabled" unchecked and I've limited the profiling using cudaProfilerStart
/cudaProfilerStop
like so:
/* --- generate data etc --- */
// Call the function once to warm up the FFT plan cache
applyConvolution( T, N, stride, plans, yData, phiW, fData, y_dwt );
gpuErrchk( cudaDeviceSynchronize() );
// Call it once for profiling
cudaProfilerStart();
applyConvolution( T, N, stride, plans, yData, phiW, fData, y_dwt );
gpuErrchk( cudaDeviceSynchronize() );
cudaProfilerStop();
where applyConvolution()
is the function that I'm profiling.
I am using CUDA Toolkit 8.0 on Ubuntu 16.04 with a GTX 1080.
As I was writing up this question, I thought I'd try messing around with the profiler settings to try and preempt some potential answer-in-comment material.
To my surprise, disabling "Enable concurrent kernel profiling" got rid of the profiler overhead completely:
But perhaps this shouldn't have been that much of a surprise:
Enable concurrent kernel profiling - This option should be selected for an application that uses CUDA streams to launch kernels that can execute concurrently. If the application uses only a single stream (and therefore cannot have concurrent kernel execution), deselecting this option may decrease profiling overhead.
(taken from http://docs.nvidia.com/cuda/profiler-users-guide/)
An earlier version of the CUDA Profiler User's Guide also noted in a "Profiling Limitations" section that:
Concurrent kernel mode can add significant overhead if used on kernels that execute a large number of blocks and that have short execution durations.
Oh well. Posting this question/answer anyways in case it helps someone else avoid this annoyance.