I am running inference on a hugging face model using FastAPI and Uvicorn.
The code looks roughly like this:
app = FastAPI()
@app.post("/inference")
async def func(text:str):
output = huggingfacepipeline(input.text)
return ...
I start the server like this:
uvicorn app:app --host 0.0.0.0 --port 8080 --workers 4
The server has enough GPU (80GB).
What I expect to happen is each of the 4 workers gets its own GPU memory space and there are 4 CPU forks of the main thread, 1 for each worker. I can check the GPU memory allocation using nvidia-smi
. So there should be 4 CPU forks and 4 processes in the GPU.
This ^ happens like clockwork when I use a smaller model (like GPT Neo 125m).
But when I use a larger model (like GPT-J in 16-bit), the behavior is often unpredictable. Sometimes there are 4 CPU forks but only 3 processes are in the GPU. Even though there is enough free space left over. Sometimes there is only 1 process in the GPU and 4 CPU forks.
What could be causing this and how do I diagnose further?
When using multiple workers, each workers gets its own copy of the model in GPU. Loading the models into GPU is a memory-intensive task. Loading N models into memory leads to frequent timeout errors. These errors can be seen in the output of dmesg
.
Uvicorn doesn't have very good support for workers. When the worker times out, it doesn't continually try to reload it. Hence, frequently, only a smaller number of copies of the models (than the number of workers) is actually loaded into GPU.
The timeout errors are explicitly mentioned when Gunicorn is used. Using Gunicorn with 1) Uvicorn workers (because FastAPI is async) and 2) a high value for the --timeout
option takes care of the problem.