This question is a followup on Jason R's comment to Robert Crovellas answer on this original question ("Multiple CUDA contexts for one device - any sense?"):
When you say that multiple contexts cannot run concurrently, is this limited to kernel launches only, or does it refer to memory transfers as well? I have been considering a multiprocess design all on the same GPU that uses the IPC API to transfer buffers from process to process. Does this mean that effectively, only one process at a time has exclusive access to the entire GPU (not just particular SMs)? [...] How does that interplay with asynchronously-queued kernels/copies on streams in each process as far as scheduling goes?
Robert Crovella suggested asking this in a new question but it never happed, so let me do this here.
Multi-Process Service is an alternative CUDA implementation by Nvidia that makes multiple processes use the same context. This e.g. allows kernels from multiple processes to run in parallel if each of them does not fill the entire GPU by itself.