pythonflaskgunicornwsgiwaitress

Waitress and GUnicorn large data input is much slower than Flask development server


Problem description

I am trying to create a Flask app that should:

I've made a quick test and run it with the Flask development server and it worked as expected. Scared by the red writing WARNING: This is a development server. Do not use it in a production deployment. I tried putting it behind a WSGI server but both Waitress and GUnicorn achieved much slower results. Tests (on a toy problem with artificial input, tiny output, and fully replicable code) are below.

Code to run the tests

I've put these three files in a folder:

basic_flask_app.py (this here is supposed to do very little with the data it gets; the real code I have is a deep learning model that runs quite fast on GPU, but this example here is created to make the issue more extreme)

import numpy as np

from flask import Flask, request
from do_request import IS_SMALL_DATA, WIDTH, HEIGHT

app = Flask(__name__)


@app.route('/predict', methods=['POST'])
def predict():
    numpy_bytes = np.frombuffer(request.data, np.float32)
    if IS_SMALL_DATA:
        numpy_image = np.zeros((HEIGHT, WIDTH)) + numpy_bytes
    else:
        numpy_image = numpy_bytes.reshape(HEIGHT, WIDTH)
    result = numpy_image.mean(axis=1).std(axis=0)
    return result.tobytes()


if __name__ == '__main__':
    app.run(host='localhost', port=80, threaded=False, processes=1)

[Edited: the original version of this question was missing the parameters threaded=False, processes=1 in the call to app.run above, so the behaviour was not the same to GUnicorn and Waitress below, which instead are forced to single thread/process; I've added it now, and re-tested, the results don't change, Flask server is still fast after this change - if anything, faster]

do_request.py

import requests
import numpy as np
from tqdm import trange

WIDTH = 2500
HEIGHT = 3000
IS_SMALL_DATA = False


def main(url='http://127.0.0.1:80/predict'):
    n = WIDTH * HEIGHT
    if IS_SMALL_DATA:
        np_image = np.zeros(1, dtype=np.float32)
    else:
        np_image = np.arange(n).astype(np.float32) / np.float32(n)
    results = []
    for _ in trange(50):
        results.append(requests.post(url, data=np_image.tobytes()))


if __name__ == '__main__':
    main()

waitress_server.py

from waitress import serve
import basic_flask_app
serve(basic_flask_app.app, host='127.0.0.1', port=80, threads=1)

Test results

I've run the tests running python do_requests.py after starting the model with either of the following three commands:

python basic_flask_app.py
python waitress_server.py 
gunicorn -w 1 basic_flask_app:app -b 127.0.0.1:80

With these three options, and toggling the IS_SMALL_DATA flag (if True, only 4 bytes of data are transmitted; if False, 30MB) I got the following timings:

50 requests              Flask               Waitress             GUnicorn
30MB input, 4B output:   00:01 (28.6 it/s)   00:11 (4.42 it/s)    00:11 (4.26 it/s)
4B input, 4B output:     00:01 (25.2 it/s)   00:02 (23.6 it/s)    00:01 (26.4 it/s)

As you can see, Flask development server is very fast independently of the amount of data transmitted (the "small" data is even a bit slower, probably because it wastes time allocating the memory on each of the 50 iterations), while both Waitress and GUnicorn get a significant hit on speed with more transmitted data.

Questions

At this point, I have a couple of questions:


Solution

  • This is insteresting. May be this will explain the question.

    1. By using time.time() I found request.data in web app cost different time. When using gunicorn this cost more than 95% time which is 0.35s. When using flask web app this cost about 0.001s.

    2. I step into it's package. I found most time spended in werkzeug/wrappers/base_request.py 456 line which is

      rv = self.stream.read()

      When using flask dev server. This self.stream is werkzeug.wsgi.LimitedStream. This line cost about 0.001s.

      When using gunicorn. This self.stream is gunicorn.http.body.Body. This will cost more than 0.3s.

    3. I step into gunicorn/http/body.py. In Line 214-218

       while size > self.buf.tell():
           data = self.reader.read(1024)
           if not data:
               break
           self.buf.write(data)
      

      This cost more than 0.3s.

    4. I try to change above code into self.buf.write(self.reader.read(size)). This making it cost 0.07s.

    5. I split above code into

       now = time.time()
       buffer = self.reader.read(size)
       print(time.time() - now)
       now = time.time()
      

      I found first line cost 0.053. Second line cost 0.017.

    I think I already found the reason.

    First, gunicorn wrap raw bytes into his special Object using io.BytesIO.

    Second, gunicorn using while loop read bytes which will cost more time.

    I guess the purpose of these code is supporting high concurrency.

    In your case, I think you can just use gevent directly.

    from gevent.pywsgi import WSGIServer
    from basic_flask_app import app
    
    http_server = WSGIServer(('', 80), app)
    http_server.serve_forever()
    

    this is much faster.