System Reliability - One Crash per Thousand Server Years

来源：互联网发布：新倩女幽魂有mac版编辑：程序博客网时间：2024/05/22 17:50

Servers crash, except when they don’t. Our servers don’t crash, at least not often. Our current rate is about one crash per thousand server years. That means for every thousand servers we run, we get one crash per year, including on all our Xen-based private clouds and their VMs. This is darn low by anyone’s standards.

Why is this so good?

Well, it’s due to a number of factors, so let me specify.

First, 95% of our systems use CentOS, which is of course the open-source version of RedHat Enterprise Linux (RHEL). The entire purpose of RedHat Linux is stability, such that it has strict rules about the changes made over the 5-10 year lifetime of a major version (which is currently RHEL7). That means when a new version debuts it is essentially frozen in time - for RHEL7, that time was early 2014. There will be no new features or anything added except major bug and security fixes for the next 12-15 years.

Now, a lot of people complain this is too old, and that, for example, using the also current RHEL6 locks you into something from 2010. This is partly true, but for almost everything else it is irrelevant. You don’t really need the new features in most cases, except for Docker, and we always install the latest services anyway. With an installation like this you are really only locked into a stable kernel version and related services such as drivers, LVM/DM, iptables, etc., plus some bundled core services such as openssl, bash, etc.

People often want us to use the latest version of Ubuntu/Debian instead of CentOS, which we generally resist (see other blogs on this). In most cases they are less stable and cause more problems for us (not the least of which is the lack of various deb packages). But the real challenge is reliability and stability as we’ve had far, far more problems with Ubuntu than RHEL/Centos. It is as simple as that.

We also use the best hardware we can, which helps a lot. Our standard is Dell rack servers, especially R420s, which work well for us with few problems. This mainstream server is well-supported, tested, understood, and contributes to overall stability. The same is true for choosing good clouds where crashes are rare.

On top of the hardware, we install only the basics, without any unneeded services and junk. Doing so reduces the security attack surfaces and also improves reliability by having fewer moving parts, libraries, services, and other things floating around.

In and above the OS we also include world-class configurations for the kernel, firewalls, drivers, and more, plus all the services such as MySQL, HAproxy, etc. We also closely monitor all this to ensure the OOM beast stays in its cage, swap is minimal (but always available), NUMA is interleaved, and more, all in the name of stability.

Also, I should mention that our installed services rarely crash - almost never for core services like Nginx, Apache, and Redis, etc. We do see some issues in older PHP-FPM, while MySQL will crash maybe once per hundred server years. This reliability is due to all the above, plus using vendor-supplied RPMs (almost never built-from-source), the latest versions, and best-practice configurations.

In the end, good reliability is not that hard to achieve, but it does take dedication, expertise, and resources to choose and configure the best you can, at all levels, all the time. See you in a thousand server years.

0 0