uWSGI python highload configuration

We have a big EC2 instance with 32 cores, currently running Nginx, Tornado and Redis, serving on average 5K requests per second. Everything seems to work fine, but the CPU load already reaching 70% and we have to support even more requests. One of the thoughts was to replace Tornado with uWSGI because we don't really use async features of Tornado.

Our application consist from one function, it receives a JSON (~=4KB), doing some blocking but very fast stuff (Redis) and return JSON.

Proxy HTTP request to one of the Tornado instances (Nginx)
Parse HTTP request (Tornado)
Read POST body string (stringified JSON) and convert it to python dictionary (Tornado)
Take data out of Redis (blocking) located on same machine (py-redis with hiredis)
Process the data (python3.4)
Update Redis on same machine (py-redis with hiredis)
Prepare stringified JSON for response (python3.4)
Send response to proxy (Tornado)
Send response to client (Nginx)

We thought the speed improvement will come from uwsgi protocol, we can install Nginx on separate server and proxy all requests to uWSGI with uwsgi protocol. But after trying all possible configurations and changing OS parameters we still can't get it working even on current load. Most of the time nginx log contains 499 and 502 errors. In some configurations it just stopped receiving new requests like it hit some OS limit.

So as I said, we have 32 cores, 60GB free memory and very fast network. We don't do heavy stuff, only very fast blocking operations. What is the best strategy in this case? Processes, Threads, Async? What OS parameters should be set?

Current configuration is:

[uwsgi]
master = 2
processes = 100
socket = /tmp/uwsgi.sock
wsgi-file = app.py
daemonize = /dev/null
pidfile = /tmp/uwsgi.pid
listen = 64000
stats = /tmp/stats.socket
cpu-affinity = 1
max-fd = 20000
memory-report = 1
gevent = 1000
thunder-lock = 1
threads = 100
post-buffering = 1

Nginx config:

user www-data;
worker_processes 10;
pid /run/nginx.pid;

events {
    worker_connections 1024;
    multi_accept on;
    use epoll;
}

OS config:

sysctl net.core.somaxconn
net.core.somaxconn = 64000

I know the limits are too high, started to try every value possible.

UPDATE:

I ended up with the following configuration:

[uwsgi]
chdir = %d
master = 1
processes = %k
socket = /tmp/%c.sock
wsgi-file = app.py
lazy-apps = 1
touch-chain-reload = %dreload
virtualenv = %d.env
daemonize = /dev/null
pidfile = /tmp/%c.pid
listen = 40000
stats = /tmp/stats-%c.socket
cpu-affinity = 1
max-fd = 200000
memory-report = 1
post-buffering = 1
threads = 2

Solution

I think your request handling roughly breaks down as follows:

HTTP parsing, request routing, JSON parsing
execute some python code which yields a redis request
(blocking) redis request
execute some python code which processes the redis response
JSON serialization, HTTP response serialization

You could benchmark the handling time on a near-idle system. My hunch is that the round trip would boil down to 2 or 3 milliseconds. At 70% CPU load this would go up to about 4 or 5 ms (not counting time spent in nginx request queue, just the handling in uWSGI worker).

At 5k req/s your average in-process request could would be in the 20 ... 25 range. A decent match to your VM.

Next step is to balance the CPU cores. If you have 32 cores, it does not make sense to allocate 1000 worker processes. You might end up chocking the system on context switching overhead. A good balancing will have the total amount of workers (nginx+uWSGI+redis) in the order of magnitude as the available CPU cores, maybe with a little extra to cover for blocking I/O (i.e. filesystem, but mainly networked requests being done to other hosts like a DBMS). If blocking I/O becomes a big part of the equation, consider rewriting into asynchronous code and integrating an async stack.

First observation: you're allocating 10 workers to nginx. However the CPU time nginx spends on a request is MUCH lower than the time uWSGI spends on it. I would start by dedicating about 10% of the system to nginx (3 or 4 worker processes).

The remainder would have to be split between uWSGI and redis. I don't know about the size of your indices in redis, or about the complexity of your python code, but my first attempt would be a 75%/25% split between uWSGI and redis. That would put redis on about 6 workers and uWSGI on about 20 workers + a master.

As for the threads option in uwsgi configuration: thread switching is lighter than process switching, but if a significant part of your python code is CPU-bound it won't fly because of GIL. Threads option is mainly interesting if a significant part of your handling time is I/O blocked. You could disable threads, or try with workers=10, threads=2 as an initial attempt.