As we know Fermi support only single connection to GPU, and as written here: http://on-demand.gputechconf.com/gtc-express/2011/presentations/StreamsAndConcurrencyWebinar.pdf
Fermi architecture can simultaneously support
Up to 16 CUDA kernels on GPU
And as we know Hyper-Q allows for up to 32 simultaneous connections from multiple CUDA streams, MPI processes, or threads within a process: http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
But how many kernels simultaneously support on Kepler CC3.0/3.5, 16 or 32 (STREAMs)?
From the programming guide:
The maximum number of kernel launches that a device can execute concurrently is 32 on devices of compute capability 3.5 and 16 on devices of lower compute capability.