[SOLVED] What is the difference between 'GPU activities' and 'API calls' in the results of 'nvprof'?

What is the difference between 'GPU activities' and 'API calls' in the results of 'nvprof'?

I don't know why there's a time difference in the same function. For example, [CUDA memcpy DtoH] and cuMemcpyDtoH.

So I don't know what the right time is. I have to write a measurement, but I don't know which one to use.

Solution

Activities are actual usage of the GPU for some particular task.

An activity might be running a kernel, or it might be using GPU hardware to transfer data from Host to Device or vice-versa.

The duration of such an "activity" is the usual sense of duration: when did this activity start using the GPU, and when did it stop using the GPU.

API calls are calls made by your code (or by other CUDA API calls made by your code) into the CUDA driver or runtime libraries.

The two are related of course. You perform an activity on the GPU by initiating it with some sort of API call. This is true for data copying and running kernels.

However there can be a difference in "duration" or reported times. If I launch a kernel, for example, there may be many reasons (e.g. previous activity that is not yet complete in the same stream) why the kernel does not "immediately" begin executing. The kernel "launch" may be outstanding from an API perspective for a much longer time than the actual runtime duration of the kernel.

This applies to may other facets of API usage as well. For example, cudaDeviceSynchronize() can appear to require a very long time or a very short time, depending on what is happening (activities) on the device.

You may get a better sense of the difference between these two categories of reporting by studying the timeline in the NVIDIA visual profiler (nvvp).

Let's use your specific case as an example. This appears to be an app associated with the driver API, and you apparently have a kernel launch and (I would guess) a D->H memcpy operation immediately after the kernel launch:

multifrag_query_hoisted_kernels (kernel launch - about 479ms)
cuMemcpyDtoH  (data copy D->H, about 20us)

In that situation, because CUDA kernel launches are asynchronous, the host code will launch the kernel and it will then proceed to the next code line, which is a cuMemcpyDtoH call, which is a blocking call. This means the call causes the CPU thread to wait there until the previous CUDA activity is complete.

The activity portion of the profiler tells us the kernel duration is around 479ms and the copy duration is around 20us (much much shorter). From the standpoint of activity duration, these are the times that are relevant. However, as viewed from the host CPU thread, the time it required the host CPU thread to "launch" the kernel was much shorter than 479ms, and the time it required the host CPU thread to complete the call to cuMemcpyDtoH and proceed to the next line of code was much longer than 20us, because it had to wait there at that library call, until the previously issued kernel was complete. Both of these are due to the asynchronous nature of CUDA kernel launches, and the "blocking" or synchronous nature of cuMemcpyDtoH.