Waitress and GUnicorn large data input is much slower than Flask development server

Problem description

I am trying to create a Flask app that should:

Be visible on localhost only, so no network slowdown
Get quite a lot of data (30MB as a large numpy array) as input and output a relatively smaller amount of data (around 1MB).

I've made a quick test and run it with the Flask development server and it worked as expected. Scared by the red writing WARNING: This is a development server. Do not use it in a production deployment. I tried putting it behind a WSGI server but both Waitress and GUnicorn achieved much slower results. Tests (on a toy problem with artificial input, tiny output, and fully replicable code) are below.

Code to run the tests

I've put these three files in a folder:

basic_flask_app.py (this here is supposed to do very little with the data it gets; the real code I have is a deep learning model that runs quite fast on GPU, but this example here is created to make the issue more extreme)

import numpy as np

from flask import Flask, request
from do_request import IS_SMALL_DATA, WIDTH, HEIGHT

app = Flask(__name__)


@app.route('/predict', methods=['POST'])
def predict():
    numpy_bytes = np.frombuffer(request.data, np.float32)
    if IS_SMALL_DATA:
        numpy_image = np.zeros((HEIGHT, WIDTH)) + numpy_bytes
    else:
        numpy_image = numpy_bytes.reshape(HEIGHT, WIDTH)
    result = numpy_image.mean(axis=1).std(axis=0)
    return result.tobytes()


if __name__ == '__main__':
    app.run(host='localhost', port=80, threaded=False, processes=1)

[Edited: the original version of this question was missing the parameters threaded=False, processes=1 in the call to app.run above, so the behaviour was not the same to GUnicorn and Waitress below, which instead are forced to single thread/process; I've added it now, and re-tested, the results don't change, Flask server is still fast after this change - if anything, faster]

do_request.py

import requests
import numpy as np
from tqdm import trange

WIDTH = 2500
HEIGHT = 3000
IS_SMALL_DATA = False


def main(url='http://127.0.0.1:80/predict'):
    n = WIDTH * HEIGHT
    if IS_SMALL_DATA:
        np_image = np.zeros(1, dtype=np.float32)
    else:
        np_image = np.arange(n).astype(np.float32) / np.float32(n)
    results = []
    for _ in trange(50):
        results.append(requests.post(url, data=np_image.tobytes()))


if __name__ == '__main__':
    main()

waitress_server.py

from waitress import serve
import basic_flask_app
serve(basic_flask_app.app, host='127.0.0.1', port=80, threads=1)

Test results

I've run the tests running python do_requests.py after starting the model with either of the following three commands:

python basic_flask_app.py
python waitress_server.py 
gunicorn -w 1 basic_flask_app:app -b 127.0.0.1:80

With these three options, and toggling the IS_SMALL_DATA flag (if True, only 4 bytes of data are transmitted; if False, 30MB) I got the following timings:

50 requests              Flask               Waitress             GUnicorn
30MB input, 4B output:   00:01 (28.6 it/s)   00:11 (4.42 it/s)    00:11 (4.26 it/s)
4B input, 4B output:     00:01 (25.2 it/s)   00:02 (23.6 it/s)    00:01 (26.4 it/s)

As you can see, Flask development server is very fast independently of the amount of data transmitted (the "small" data is even a bit slower, probably because it wastes time allocating the memory on each of the 50 iterations), while both Waitress and GUnicorn get a significant hit on speed with more transmitted data.

Questions

At this point, I have a couple of questions:

Do Waitress & GUnicorn run some kind of check on the submitted data that takes time? If so, is there a way to disable them?
Is there an important reason why Waitress / GUnicorn are better than the Flask development server, or could I just use it for my use case? As mentioned:
- I don't care about security; these are visible only from localhost, and I generate the data that go into them in another process of mine
- I actively want one process/thread running at the same time, which is the only possibility for Flask development server and which I enforced for the others. This is because my real app will run on GPU, and if I have many processes / threads I'll quickly go out of memory
- I know that at any point in time there will be only a small number of connections to this server (probably 4, certainly not more than 8), so scaling is not an issue, either.
- ... but this would be in production, so I need something reliable and stable.

Solution

This is insteresting. May be this will explain the question.

By using time.time() I found request.data in web app cost different time. When using gunicorn this cost more than 95% time which is 0.35s. When using flask web app this cost about 0.001s.
I step into it's package. I found most time spended in werkzeug/wrappers/base_request.py 456 line which is

rv = self.stream.read()

When using flask dev server. This self.stream is werkzeug.wsgi.LimitedStream. This line cost about 0.001s.

When using gunicorn. This self.stream is gunicorn.http.body.Body. This will cost more than 0.3s.

I step into gunicorn/http/body.py. In Line 214-218

 while size > self.buf.tell():
     data = self.reader.read(1024)
     if not data:
         break
     self.buf.write(data)

This cost more than 0.3s.

I try to change above code into self.buf.write(self.reader.read(size)). This making it cost 0.07s.

I split above code into

 now = time.time()
 buffer = self.reader.read(size)
 print(time.time() - now)
 now = time.time()

I found first line cost 0.053. Second line cost 0.017.

I think I already found the reason.

First, gunicorn wrap raw bytes into his special Object using io.BytesIO.

Second, gunicorn using while loop read bytes which will cost more time.

I guess the purpose of these code is supporting high concurrency.

In your case, I think you can just use gevent directly.

from gevent.pywsgi import WSGIServer
from basic_flask_app import app

http_server = WSGIServer(('', 80), app)
http_server.serve_forever()

this is much faster.