How does HAProxy avoids request time overhead when doing the load balancing?
I tested HAProxy and for fun compared it to a simple port forwarder written in Twisted (Python). In my preliminary tests, making a HTTP request through an HAProxy load balancer adds no overhead [1] in request time compared to making the HTTP request directly to the backend server. Whereas my own python script adds ~3x overhead in response time.
Now my scripts is written in Python and HAProxy in C, so a priori, HAProxy has an advantage of avoiding the call overhead (in going from Python code to syscalls), that the Python code experiences. But can that account for the big discrepancy in performance, or does HAProxy utilize some OS tricks to improve the performace even further? I tried profiling my Python code, but it didn't reveal any hotspots in the Python code, so my guess is that it spends most of the time in syscalls that are not accounted for in the profiling.
[1]: As reported by ab, with 100 concurrent connections and 10,000 total requests. Mean time for HAProxy is 37ms and for my Python script it is 128ms.
The setup is a TCP load balancer with two backend nodejs servers, just serving static text. On purpose I wanted to test TCP load balancing, and the test protocol then became HTTP. All three machines are virtual hosts from Digital Ocean, single threaded, 512MB Ram, 1 core. The Python script can be seen here and my haproxy.cfg can be found here
Turns out that the HAProxy
website already covers this area (my mistake of overlooking it). The answer is basically a lot of low level optimizations. Directly copied from the HAProxy website:
HAProxy involves several techniques commonly found in Operating Systems architectures to achieve the absolute maximal performance :
a single-process, event-driven model considerably reduces the cost of context switch and the memory usage. Processing several hundreds of tasks in a millisecond is possible, and the memory usage is in the order of a few kilobytes per session while memory consumed in Apache
-like models is more in the order of megabytes per process.
O(1)
event checker on systems that allow it (Linux
and FreeBSD
) allowing instantaneous detection of any event on any connection among tens of thousands.
Single-buffering without any data copy between reads and writes whenever possible. This saves a lot of CPU
cycles and useful memory bandwidth. Often, the bottleneck will be the I/O
busses between the CPU
and the network interfaces. At 10 Gbps
, the memory bandwidth can become a bottleneck too.
Zero-copy forwarding is possible using the splice()
system call under Linux
, and results in real zero-copy starting with Linux
3.5. This allows a small sub-3 Watt device such as a Seagate Dockstar
to forward HTTP
traffic at one gigabit/s
.
MRU
memory allocator using fixed size memory pools for immediate memory allocation favoring hot cache regions over cold cache ones. This dramatically reduces the time needed to create a new session.
work factoring, such as multiple accept()
at once, and the ability to limit the number of accept()
per iteration when running in multi-process mode, so that the load is evenly distributed among processes.
tree-based storage, making heavy use of the Elastic Binary
tree I have been developping for several years. This is used to keep timers ordered, to keep the runqueue ordered, to manage round-robin and least-conn queues, with only an O(log(N))
cost.
optimized HTTP
header analysis : headers are parsed an interpreted on the fly, and the parsing is optimized to avoid an re-reading of any previously read memory area. Checkpointing is used when an end of buffer is reached with an incomplete header, so that the parsing does not start again from the beginning when more data is read. Parsing an average HTTP
request typically takes 2 microseconds on a Pentium-M 1.7 GHz
.
careful reduction of the number of expensive system calls. Most of the work is done in user-space by default, such as time reading, buffer aggregation, file-descriptor enabling/disabling.