In this Blog, we will learn what is HAProxy and important performance factors.
HAProxy is a free, very fast and reliable solution offering high availability, load balancing, and proxying for TCP and HTTP-based applications. It is particularly suited for very high traffic websites and powers quite a number of the world’s most visited ones. Over the years it has become the de-facto standard open source load balancer, is now shipped with most mainstream Linux distributions, and is often deployed by default in cloud platforms
HAProxy involves several techniques commonly found in Operating Systems architectures to achieve the absolute maximal performance:
- A single-process, event-driven model considerably reduces the cost of context switch and the memory usage. Processing several hundreds of tasks in a millisecond is possible, and the memory usage is in the order of a few kilobytes per session while memory consumed in preforked or threaded servers is more in the order of megabytes per process.
- Delayed updates to the event checker using a lazy event cache ensures that we never update an event unless absolutely required. This saves a lot of system calls.
- Single-buffering without any data copy between reads and writes whenever possible. This saves a lot of CPU cycles and useful memory bandwidth. Often, the bottleneck will be the I/O busses between the CPU and the network interfaces. At 10-100 Gbps, the memory bandwidth can become a bottleneck too.
- MRU memory allocator using fixed size memory pools for immediate memory allocation favoring hot cache regions over cold cache ones. This dramatically reduces the time needed to create a new session.
- Work factoring, such as multiple accept () at once, and the ability to limit the number of accepting () per iteration when running in multi-process mode, so that the load is evenly distributed among processes.
- CPU-affinity is supported when running in multi-process mode, or simply to adapt to the hardware and be the closest possible to the CPU core managing the NICs while not conflicting with it.
- Optimized HTTP header analysis: headers are parsed an interpreted on the fly, and the parsing is optimized to avoid a re-reading of any previously read memory area. Checkpointing is used when an end of the buffer is reached with an incomplete header so that the parsing does not start again from the beginning when more data is read. Parsing an average HTTP request typically takes half a microsecond on a fast Xeon.
- Careful reduction of the number of expensive system calls. Most of the work is done in user-space by default, such as time reading, buffer aggregation, file-descriptor enabling/disabling.
- Content analysis is optimized to carry only pointers to original data and never copy unless the data needs to be transformed. This ensures that very small structures are carried over and that contents are never replicated when not absolutely necessary.
All these micro-optimizations result in very low CPU usage even on moderate loads. And even at very high loads, when the CPU is saturated, it is quite common to note figures like 5% user and 95% system, which means that the HAProxy process consumes about 20 times less than its system counterpart. This explains why the tuning of the Operating System is very important. This is the reason why we ended up building our own appliances, in order to save that complex and critical task from the end-user.
In production, HAProxy has been installed several times as an emergency solution when very expensive, high-end hardware load balancers suddenly failed on Layer 7 processing. Some hardware load balancers still do not use proxies and process requests at the packet level and have a great difficulty in supporting requests across multiple packets and high response times because they do no buffer at all.
On the other side, software load balancers use TCP buffering and are insensible to long requests and high response times. A nice side effect of HTTP buffering is that it increases the server’s connection acceptance by reducing the session duration, which leaves room for new requests.
There are 3 important factors used to measure a load balancer’s performance:
The session rate: This factor is very important because it directly determines when the load balancer will not be able to distribute all the requests it receives. It is mostly dependant on the CPU. Sometimes, you will hear about requests/s or hits/s, and they are the same as sessions/s in HTTP/1.0 or HTTP/1.1 with keep-alive disabled.
Requests/s with keep-alive enabled is generally much higher (since it significantly reduces system-side work) but is often meaningless for Internet-facing deployments since clients often open a large number of connections and do not send many requests per connection on average. This factor is measured with varying object sizes, the fastest results generally coming from empty objects (eg: HTTP 302, 304 or 404 response codes). Session rates around 100,000 sessions/s can be achieved on Xeon E5 systems in 2014.
The session concurrency: This factor is tied to the previous one. Generally, the session rate will drop when the number of concurrent sessions increases. The slower the servers, the higher the number of concurrent sessions for the same session rate. If a load balancer receives 10000 sessions per second and the servers respond in 100 ms, then the load balancer will have 1000 concurrent sessions. This number is limited by the amount of memory and the number of file-descriptors the system can handle. With 16 kB buffers, HAProxy will need about 34 kB per session, which results in around 30000 sessions per GB of RAM.
In practice, socket buffers in the system also need some memory and 20000 sessions per GB of RAM is more reasonable. Layer 4 load balancers generally announce millions of simultaneous sessions because they need to deal with the TIME_WAIT sockets that the system handles for free in a proxy. Also, they don’t process any data so they don’t need any buffer. Moreover, they are sometimes designed to be used in Direct Server Return mode, in which the load balancer only sees forward traffic, and which forces it to keep the sessions for a long time after their end to avoid cutting sessions before they are closed.
The data forwarding rate: This factor generally is at the opposite of the session rate. It is measured in Megabytes/s (MB/s), or sometimes in Gigabits/s (Gbps). Highest data rates are achieved with large objects to minimize the overhead caused by session setup and teardown. Large objects generally increase session concurrency, and high session concurrency with high data rate requires large amounts of memory to support large windows.
High data rates burn a lot of CPU and bus cycles on software load balancers because the data has to be copied from the input interface to memory and then back to the output device. Hardware load balancers tend to directly switch packets from input port to output port for higher data rate, but cannot process them and sometimes fail to touch a header or a cookie. Haproxy on a typical Xeon E5 of 2014 can forward data up to about 40 Gbps. A fanless 1.6 GHz Atom CPU is slightly above 1 Gbps.
A load balancer’s performance related to these factors is generally announced for the best case (eg: empty objects for session rate, large objects for data rate). This is not because of lack of honesty from the vendors, but because it is not possible to tell exactly how it will behave in every combination. So when those 3 limits are known, the customer should be aware that it will generally perform below all of them. A good rule of thumb on software load balancers is to consider an average practical performance of half of maximal session and data rates for average sized objects.
Being obsessed with reliability, I tried to do my best to ensure a total continuity of service by design. It’s more difficult to design something reliable from the ground up in the short term, but in the long term, it reveals easier to maintain than broken code which tries to hide its own bugs behind respawning processes and tricks like this.
In single-process programs, you have no right to fail: the smallest bug will either crash your program, make it spin like mad or freeze. There has not been any such bug found in stable versions for the last 13 years, though it happened a few times with development code running in production.
HAProxy has been installed on Linux 2.4 systems serving millions of pages every day, and which have only known one reboot in 3 years for a complete OS upgrade. Obviously, they were not directly exposed to the Internet because they did not receive any patch at all. The kernel was a heavily patched 2.4 with Robert Love’s jiffies64 patches to support time wrap-around at 497 days (which happened twice). On such systems, the software cannot fail without being immediately noticed!
Right now, it’s being used in many Fortune 500 companies around the world to reliably serve billions of pages per day or relay huge amounts of money. Some people even trust it so much that they use it as the default solution to solve simple problems (and I often tell them that they do it the dirty way). Such people sometimes still use versions 1.1 or 1.2 which sees very limited evolutions and which targets mission-critical usages.
HAProxy is really suited for such environments because the indicators it returns provide a lot of valuable information about the application’s health, behavior and defects, which are used to make it even more reliable. Version 1.3 has now received far more testing than 1.1 and 1.2 combined, so users are strongly encouraged to migrate to a stable 1.3 or 1.4 for mission-critical usages.
As previously explained, most of the work is executed by the Operating System. For this reason, a large part of the reliability involves the OS itself. Latest versions of Linux 2.4 have been known for offering the highest level of stability ever. However, it requires a bunch of patches to achieve a high level of performance, and this kernel is really outdated now so running it on recent hardware will often be difficult (though some people still do).
Linux 2.6 and 3.x include the features needed to achieve this level of performance, but old LTS versions only should be considered for really stable operations without upgrading more than once a year. Some people prefer to run it on Solaris (or do not have the choice). Solaris 8 and 9 are known to be really stable right now, offering a level of performance comparable to legacy Linux 2.4 (without the epoll patch). Solaris 10 might show performances closer to early Linux 2.6.
FreeBSD shows good performance but pf (the firewall) eats half of it and needs to be disabled to come close to Linux. OpenBSD sometimes shows socket allocation failures due to sockets staying in FIN_WAIT2 state when client suddenly disappears. Also, I’ve noticed that hot reconfiguration does not work under OpenBSD.
The reliability can significantly decrease when the system is pushed to its limits. This is why finely tuning the sysctls is important. There is no general rule, every system and every application will be specific. However, it is important to ensure that the system will never run out of memory and that it will never swap. A correctly tuned system must be able to run for years at full load without slowing down or crashing.
Security is an important concern when deploying a software load balancer. It is possible to harden the OS, to limit the number of open ports and accessible services, but the load balancer itself stays exposed. For this reason, I have been very careful about programming style. Vulnerabilities are very rarely encountered on haproxy, and its architecture significantly limits their impact and often allows easy workarounds. Its remotely unpredictable even processing makes it very hard to reliably exploit any bug, and if the process ever crashes, the bug is discovered. All of them were discovered by reverse-analysis of an accidental crash BTW.
Anyway, much care is taken when writing code to manipulate headers. Impossible state combinations are checked and returned, and errors are processed from the creation to the death of a session. A few people around the world have reviewed the code and suggested clean-ups for better clarity to ease auditing. By the way, I’m used to refuse patches that introduce suspect processing or in which not enough care is taken for abnormal conditions.
I generally suggest starting HAProxy as root because it can then jail itself in a chroot and drop all of its privileges before starting the instances. This is not possible if it is not started as root because only root can execute chroot(), contrary to what some admins believe.
Logs provide a lot of information to help maintain a satisfying security level. They are commonly sent over UDP because once chrooted, the /dev/log UNIX socket is unreachable, and it must not be possible to write to a file.
The following information is particularly useful:
• source IP and port of requestor make it possible to find their origin in firewall logs ;
• session set up date generally matches firewall logs, while tear down date often matches proxies dates ;
• Proper request encoding ensures the requestor cannot hide non-printable characters, nor fool a terminal.
• Arbitrary request and response header and cookie capture help to detect scan attacks, proxies and infected hosts.
• Timers help to differentiate hand-typed requests from browsers.
HAProxy also provides regex-based header control. Parts of the request, as well as request and response headers can be denied, allowed, removed, rewritten, or added. This is commonly used to block dangerous requests or encodings (eg: the Apache Chunk exploit), and to prevent accidental information leak from the server to the client. Other features such as Cache-control checking ensure that no sensible information gets accidently cached by an upstream proxy consecutively to a bug in the application server for example.