pythonflaskgpugunicornonnxruntime

Python - Gunicorn crashes on GPU inference


I'm creating a local API for macOS that process inferences on a ONNX model (MiDaS) in python. I use onnxruntime-silicon (a fork of onnxruntime) to run the model on Apple Silicon GPU and Flask for the server side. I manage to get my script working with Flask development server but I can't make Gunicorn work on this task.

Using Flask development server (working)

Here is a working python3 script (using deployment server):

# Server libraries
from flask import Flask
# NN and Image processing libraries
import onnxruntime as ort
from cv2 import imread, imwrite, cvtColor, COLOR_BGR2RGB
import numpy as np

app = Flask(__name__)

# Use a GPU provider if available
providers = ort.get_available_providers()

# Load ONNX model
sess = ort.InferenceSession("models/model-f6b98070.onnx", providers=providers)

def postprocess(depth_map):

    '''Process and save the depth map as a JPG'''

    # Rescale to 0-255, convert to uint8 and save the image
    rescaled = (255.0 / depth_map[0].max() * (depth_map[0] - depth_map[0].min())).astype(np.uint8)
    rescaled = np.squeeze(rescaled)
    imwrite('tmp/depth.jpg', rescaled)

def preprocess(image='tmp/frame.jpg'):

    '''Load and process the image for the model'''

    input_image = imread(image) # Load image with OpenCV (384x384 only!)
    input_image = cvtColor(input_image, COLOR_BGR2RGB) # Convert to RGB
    input_array = np.transpose(input_image, (2,0,1)) # Reshape (H,W,C) to (C,H,W)
    input_array = np.expand_dims(input_array, 0) # Add the batch dimension B
    normalized_input_array = input_array.astype('float32') / 255 # Normalize
    return normalized_input_array

@app.route('/predict', methods=['POST'])
def predict():
    # Load input image
    input_array = preprocess()
    # Process inference
    input_name = sess.get_inputs()[0].name
    results = sess.run(None, {input_name: input_array})
    # Save depth map
    postprocess(results)
    return 'DONE'

if __name__ == '__main__':
    app.run(debug=True)

I can make a request like so:

import requests
response = requests.post('http://127.0.0.1:5000/predict')
print(response.status_code)

The depth map is saved as JPG, everything works as expected.

Using Gunicorn server (not working)

Now if I want to switch from the default Flask server (made for development) to Gunicorn (production WSGI server). Starting from the first script, I import the following libraries:

from gunicorn.app.base import BaseApplication
import gunicorn.glogging
import gunicorn.workers.sync

I create a Gunicorn class:

class GunicornApplication(BaseApplication):
    def __init__(self, app, options=None):
        self.application = app
        self.options = options or {}
        super().__init__()

    def load_config(self):
        for key, value in self.options.items():
            if key in self.cfg.settings and value is not None:
                self.cfg.set(key.lower(), value)

    def load(self):
        return self.application

And initialize the script like so:

if __name__ == '__main__':
    options = {'bind': '127.0.0.1:5000', 'workers': 1}
    GunicornApplication(app, options).run()

The server launch without issue but python crash when I make a request for inferences, and the script raise the following error:

>>> [ERROR] Worker (pid:10517) was sent SIGSEGV!

I get the following exeption from my request:

>>> requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

I know the error is cause by GPU since the Gunicorn script works well if the provider is set on CPU. Maybe the issue is link to a bad communication between the script and Gunicorn server threads?

Any help or suggestion is welcome!


Solution

  • From what I understand, each worker needs to load the model in its own memory block, so I decided to properly load the model for each worker with a load_model fonction and the Gunicorn post_worker_init parameter:

    sess = None
    
    def load_model(_):
        global sess
        providers = ort.get_available_providers()
        sess = ort.InferenceSession("/opt/FCPX Studio/Utils/depth map/models/model-f6b98070.onnx", providers=providers)
    
    class GunicornApplication(BaseApplication):
    
        [...]
    
        def load_config(self):
            for key, value in self.options.items():
                if key in self.cfg.settings and value is not None:
                    self.cfg.set(key.lower(), value)
    
            # Set up the post_worker_init hook to load the model.
            self.cfg.set('post_worker_init', load_model)
    
        [...]
    

    This didn't fix the issue but allowed me to receive the following error from the server on the inference requests:

    objc[51435]: +[__NSCFConstantString initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
    

    The fork safety block the process and crash python. This answer as well as this thread explain quit well the issue. I'm still not quit sure what exactly caused the fork safety to block the process.

    In the meantime I can disable the fork safety with an environment variable like so:

    export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES