I thought I had the grasp of this but apparently I do not:) I need to perform parallel H.264 stream encoding with NVENC from frames that are not in any of the formats accepted by the encoder so I have a following code pipeline:
cuMemcpy
is synchronous, so I can return from the callback, all pending operations are pushed in a dedicated stream)For some reason I had the assumption that I need a dedicated context for each thread if I perform this pipeline in parallel threads. The code was slow and after some reading I understood that the context switching is actually expensive, and then I actually came to the conclusion that it makes no sense since in a context owns the whole GPU so I lock out any parallel processing from other transcoder threads.
Question 1: In this scenario am I good with using a single context and an explicit stream created on this context for each thread that performs the mentioned pipeline?
Question 2: Can someone enlighten me on what is the sole purpose of the CUDA device context? I assume it makes sense in a multiple GPU scenario, but are there any cases where I would want to create multiple contexts for one GPU?
Question 1: In this scenario am I good with using a single context and an explicit stream created on this context for each thread that performs the mentioned pipeline?
You should be fine with a single context.
Question 2: Can someone enlighten me on what is the sole purpose of the CUDA device context? I assume it makes sense in a multiple GPU scenario, but are there any cases where I would want to create multiple contexts for one GPU?
The CUDA device context is discussed in the programming guide. It represents all of the state (memory map, allocations, kernel definitions, and other state-related information) associated with a particular process (i.e. associated with that particular process' use of a GPU). Separate processes will normally have separate contexts (as will separate devices), as these processes have independent GPU usage and independent memory maps.
If you have multi-process usage of a GPU, you will normally create multiple contexts on that GPU. As you've discovered, it's possible to create multiple contexts from a single process, but not usually necessary.
And yes, when you have multiple contexts, kernels launched in those contexts will require context switching to go from one kernel in one context to another kernel in another context. Those kernels cannot run concurrently.
CUDA runtime API usage manages contexts for you. You normally don't explicitly interact with a CUDA context when using the runtime API. However, in driver API usage, the context is explicitly created and managed.