I'm trying to implement an efficient way of doing concurrent inference in Pytorch.
Right now, I start 2 processes on my GPU (I have only 1 GPU, both process are on the same device). Each process load my Pytorch model and do the inference step.
My problem is that my model takes quite some space on the memory. I have 12Gb of memory on the GPU, and the model takes ~3Gb of memory alone (without the data). Which means together, my 2 processes takes 6Gb of memory just for the model.
Now I was wondering if it's possible to load the model only once, and use this model for inference on 2 different processes. What I want is only 3Gb of memory is consumed by the model, but still have 2 processes.
I came accross this answer mentioning IPC, but as far as I understood it means the process #2 will copy the model from process #1, so I will still end up with 6Gb allocated for the model.
I also checked on the Pytorch documentation, about DataParallel and DistributedDataParallel, but it seems not possible.
This seems to be what I want, but I couldn't find any code example on how to use with Pytorch in inference mode.
I understand this might be difficult to do such a thing for training, but please note I'm only talking about the inference step (the model is in read-only mode, no need to update gradients). With this assumption, I'm not sure if it's possible or not.
The GPU itself has many threads. When performing an array/tensor operation, it uses each thread on one or more cells of the array. This is why it seems that an op that can fully utilize the GPU should scale efficiently without multiple processes -- a single GPU kernel is already massively parallelized.
In a comment you mentioned seeing better results with multiple processes in a small benchmark. I'd suggest running the benchmark with more jobs to ensure warmup, ten kernels seems like too small of a test. If you're finding a thorough representative benchmark to run faster consistently though, I'll trust good benchmarks over my intuition.
My understanding is that kernels launched on the default CUDA stream get executed sequentially. If you want them to run in parallel, I think you'd need multiple streams. Looking in the PyTorch code, I see code like getCurrentCUDAStream()
in the kernels, which makes me think the GPU will still run any PyTorch code from all processes sequentially.
This NVIDIA discussion suggests this is correct:
https://devtalk.nvidia.com/default/topic/1028054/how-to-launch-cuda-kernel-in-different-processes/
Newer GPUs may be able to run multiple kernels in parallel (using MPI?) but it seems like this is just implemented with time slicing under the hood anyway, so I'm not sure we should expect higher total throughput:
How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications?
If you do need to share memory from one model across two parallel inference calls, can you just use multiple threads instead of processes, and refer to the same model from both threads?
To actually get the GPU to run multiple kernels in parallel, you may be able to use nn.Parallel in PyTorch. See the discussion here: https://discuss.pytorch.org/t/how-can-l-run-two-blocks-in-parallel/61618/3