I am trying to create a Flask app that should:
I've made a quick test and run it with the Flask development server and it worked as expected. Scared by the red writing WARNING: This is a development server. Do not use it in a production deployment.
I tried putting it behind a WSGI server but both Waitress and GUnicorn achieved much slower results. Tests (on a toy problem with artificial input, tiny output, and fully replicable code) are below.
I've put these three files in a folder:
basic_flask_app.py (this here is supposed to do very little with the data it gets; the real code I have is a deep learning model that runs quite fast on GPU, but this example here is created to make the issue more extreme)
import numpy as np
from flask import Flask, request
from do_request import IS_SMALL_DATA, WIDTH, HEIGHT
app = Flask(__name__)
@app.route('/predict', methods=['POST'])
def predict():
numpy_bytes = np.frombuffer(request.data, np.float32)
if IS_SMALL_DATA:
numpy_image = np.zeros((HEIGHT, WIDTH)) + numpy_bytes
else:
numpy_image = numpy_bytes.reshape(HEIGHT, WIDTH)
result = numpy_image.mean(axis=1).std(axis=0)
return result.tobytes()
if __name__ == '__main__':
app.run(host='localhost', port=80, threaded=False, processes=1)
[Edited: the original version of this question was missing the parameters threaded=False, processes=1
in the call to app.run
above, so the behaviour was not the same to GUnicorn and Waitress below, which instead are forced to single thread/process; I've added it now, and re-tested, the results don't change, Flask server is still fast after this change - if anything, faster]
do_request.py
import requests
import numpy as np
from tqdm import trange
WIDTH = 2500
HEIGHT = 3000
IS_SMALL_DATA = False
def main(url='http://127.0.0.1:80/predict'):
n = WIDTH * HEIGHT
if IS_SMALL_DATA:
np_image = np.zeros(1, dtype=np.float32)
else:
np_image = np.arange(n).astype(np.float32) / np.float32(n)
results = []
for _ in trange(50):
results.append(requests.post(url, data=np_image.tobytes()))
if __name__ == '__main__':
main()
waitress_server.py
from waitress import serve
import basic_flask_app
serve(basic_flask_app.app, host='127.0.0.1', port=80, threads=1)
I've run the tests running python do_requests.py
after starting the model with either of the following three commands:
python basic_flask_app.py
python waitress_server.py
gunicorn -w 1 basic_flask_app:app -b 127.0.0.1:80
With these three options, and toggling the IS_SMALL_DATA
flag (if True, only 4 bytes of data are transmitted; if False, 30MB) I got the following timings:
50 requests Flask Waitress GUnicorn
30MB input, 4B output: 00:01 (28.6 it/s) 00:11 (4.42 it/s) 00:11 (4.26 it/s)
4B input, 4B output: 00:01 (25.2 it/s) 00:02 (23.6 it/s) 00:01 (26.4 it/s)
As you can see, Flask development server is very fast independently of the amount of data transmitted (the "small" data is even a bit slower, probably because it wastes time allocating the memory on each of the 50 iterations), while both Waitress and GUnicorn get a significant hit on speed with more transmitted data.
At this point, I have a couple of questions:
This is insteresting. May be this will explain the question.
By using time.time() I found request.data
in web app cost different time. When using gunicorn this cost more than 95% time which is 0.35s. When using flask web app this cost about 0.001s.
I step into it's package. I found most time spended in werkzeug/wrappers/base_request.py 456 line
which is
rv = self.stream.read()
When using flask dev server. This self.stream
is werkzeug.wsgi.LimitedStream
. This line cost about 0.001s.
When using gunicorn. This self.stream
is gunicorn.http.body.Body
. This will cost more than 0.3s.
I step into gunicorn/http/body.py
. In Line 214-218
while size > self.buf.tell():
data = self.reader.read(1024)
if not data:
break
self.buf.write(data)
This cost more than 0.3s.
I try to change above code into self.buf.write(self.reader.read(size))
. This making it cost 0.07s.
I split above code into
now = time.time()
buffer = self.reader.read(size)
print(time.time() - now)
now = time.time()
I found first line cost 0.053. Second line cost 0.017.
I think I already found the reason.
First, gunicorn wrap raw bytes into his special Object using io.BytesIO.
Second, gunicorn using while loop read bytes which will cost more time.
I guess the purpose of these code is supporting high concurrency.
In your case, I think you can just use gevent directly.
from gevent.pywsgi import WSGIServer
from basic_flask_app import app
http_server = WSGIServer(('', 80), app)
http_server.serve_forever()
this is much faster.