I'm creating a local API for macOS that process inferences on a ONNX model (MiDaS) in python. I use onnxruntime-silicon
(a fork of onnxruntime
) to run the model on Apple Silicon GPU and Flask
for the server side. I manage to get my script working with Flask development server but I can't make Gunicorn work on this task.
Here is a working python3 script (using deployment server):
# Server libraries
from flask import Flask
# NN and Image processing libraries
import onnxruntime as ort
from cv2 import imread, imwrite, cvtColor, COLOR_BGR2RGB
import numpy as np
app = Flask(__name__)
# Use a GPU provider if available
providers = ort.get_available_providers()
# Load ONNX model
sess = ort.InferenceSession("models/model-f6b98070.onnx", providers=providers)
def postprocess(depth_map):
'''Process and save the depth map as a JPG'''
# Rescale to 0-255, convert to uint8 and save the image
rescaled = (255.0 / depth_map[0].max() * (depth_map[0] - depth_map[0].min())).astype(np.uint8)
rescaled = np.squeeze(rescaled)
imwrite('tmp/depth.jpg', rescaled)
def preprocess(image='tmp/frame.jpg'):
'''Load and process the image for the model'''
input_image = imread(image) # Load image with OpenCV (384x384 only!)
input_image = cvtColor(input_image, COLOR_BGR2RGB) # Convert to RGB
input_array = np.transpose(input_image, (2,0,1)) # Reshape (H,W,C) to (C,H,W)
input_array = np.expand_dims(input_array, 0) # Add the batch dimension B
normalized_input_array = input_array.astype('float32') / 255 # Normalize
return normalized_input_array
@app.route('/predict', methods=['POST'])
def predict():
# Load input image
input_array = preprocess()
# Process inference
input_name = sess.get_inputs()[0].name
results = sess.run(None, {input_name: input_array})
# Save depth map
postprocess(results)
return 'DONE'
if __name__ == '__main__':
app.run(debug=True)
I can make a request like so:
import requests
response = requests.post('http://127.0.0.1:5000/predict')
print(response.status_code)
The depth map is saved as JPG, everything works as expected.
Now if I want to switch from the default Flask
server (made for development) to Gunicorn
(production WSGI server). Starting from the first script, I import the following libraries:
from gunicorn.app.base import BaseApplication
import gunicorn.glogging
import gunicorn.workers.sync
I create a Gunicorn class:
class GunicornApplication(BaseApplication):
def __init__(self, app, options=None):
self.application = app
self.options = options or {}
super().__init__()
def load_config(self):
for key, value in self.options.items():
if key in self.cfg.settings and value is not None:
self.cfg.set(key.lower(), value)
def load(self):
return self.application
And initialize the script like so:
if __name__ == '__main__':
options = {'bind': '127.0.0.1:5000', 'workers': 1}
GunicornApplication(app, options).run()
The server launch without issue but python crash when I make a request for inferences, and the script raise the following error:
>>> [ERROR] Worker (pid:10517) was sent SIGSEGV!
I get the following exeption from my request:
>>> requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
I know the error is cause by GPU since the Gunicorn
script works well if the provider is set on CPU. Maybe the issue is link to a bad communication between the script and Gunicorn server threads?
Any help or suggestion is welcome!
From what I understand, each worker needs to load the model in its own memory block, so I decided to properly load the model for each worker with a load_model
fonction and the Gunicorn post_worker_init
parameter:
sess = None
def load_model(_):
global sess
providers = ort.get_available_providers()
sess = ort.InferenceSession("/opt/FCPX Studio/Utils/depth map/models/model-f6b98070.onnx", providers=providers)
class GunicornApplication(BaseApplication):
[...]
def load_config(self):
for key, value in self.options.items():
if key in self.cfg.settings and value is not None:
self.cfg.set(key.lower(), value)
# Set up the post_worker_init hook to load the model.
self.cfg.set('post_worker_init', load_model)
[...]
This didn't fix the issue but allowed me to receive the following error from the server on the inference requests:
objc[51435]: +[__NSCFConstantString initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
The fork safety block the process and crash python. This answer as well as this thread explain quit well the issue. I'm still not quit sure what exactly caused the fork safety to block the process.
In the meantime I can disable the fork safety with an environment variable like so:
export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES