Is it possible to share a cudaMalloc'ed GPU buffer between different contexts (CPU threads) which use the same GPU? Each context allocates an input buffer which need to be filled up by a pre-processing kernel which will use the entire GPU and then distribute the output to them.
This scenario is ideal to avoid multiple data transfer to and from the GPUs. The application is a beamformer, which will combine multiple antenna signals and generate multiple beams, where each beam will be processed by a different GPU context. The entire processing pipeline for the beams is already in place, I just need to add the beamforming part. Having each thread generate it's own beam would duplicate the input data so I'd like to avoid that (also, the it's much more efficient to generate multiple beams at one go).
Each CUDA context has it's own virtual memory space, therefore you cannot use a pointer from one context inside another context.
That being said, as of CUDA 4.0 by default there is one context created per process and not per thread. If you have multiple threads running with the same CUDA context, sharing device pointers between threads should work without problems.