System Reliability - One Crash per Thousand Server Years
来源:互联网 发布:新倩女幽魂有mac版 编辑:程序博客网 时间:2024/05/22 17:50
Servers crash, except when they don’t. Our servers don’t crash, at least not often. Our current rate is about one crash per thousand server years. That means for every thousand servers we run, we get one crash per year, including on all our Xen-based private clouds and their VMs. This is darn low by anyone’s standards.
Why is this so good?
Well, it’s due to a number of factors, so let me specify.
First, 95% of our systems use CentOS, which is of course the open-source version of RedHat Enterprise Linux (RHEL). The entire purpose of RedHat Linux is stability, such that it has strict rules about the changes made over the 5-10 year lifetime of a major version (which is currently RHEL7). That means when a new version debuts it is essentially frozen in time - for RHEL7, that time was early 2014. There will be no new features or anything added except major bug and security fixes for the next 12-15 years.
Now, a lot of people complain this is too old, and that, for example, using the also current RHEL6 locks you into something from 2010. This is partly true, but for almost everything else it is irrelevant. You don’t really need the new features in most cases, except for Docker, and we always install the latest services anyway. With an installation like this you are really only locked into a stable kernel version and related services such as drivers, LVM/DM, iptables, etc., plus some bundled core services such as openssl, bash, etc.
People often want us to use the latest version of Ubuntu/Debian instead of CentOS, which we generally resist (see other blogs on this). In most cases they are less stable and cause more problems for us (not the least of which is the lack of various deb packages). But the real challenge is reliability and stability as we’ve had far, far more problems with Ubuntu than RHEL/Centos. It is as simple as that.
We also use the best hardware we can, which helps a lot. Our standard is Dell rack servers, especially R420s, which work well for us with few problems. This mainstream server is well-supported, tested, understood, and contributes to overall stability. The same is true for choosing good clouds where crashes are rare.
On top of the hardware, we install only the basics, without any unneeded services and junk. Doing so reduces the security attack surfaces and also improves reliability by having fewer moving parts, libraries, services, and other things floating around.
In and above the OS we also include world-class configurations for the kernel, firewalls, drivers, and more, plus all the services such as MySQL, HAproxy, etc. We also closely monitor all this to ensure the OOM beast stays in its cage, swap is minimal (but always available), NUMA is interleaved, and more, all in the name of stability.
Also, I should mention that our installed services rarely crash - almost never for core services like Nginx, Apache, and Redis, etc. We do see some issues in older PHP-FPM, while MySQL will crash maybe once per hundred server years. This reliability is due to all the above, plus using vendor-supplied RPMs (almost never built-from-source), the latest versions, and best-practice configurations.
In the end, good reliability is not that hard to achieve, but it does take dedication, expertise, and resources to choose and configure the best you can, at all levels, all the time. See you in a thousand server years.
- System Reliability - One Crash per Thousand Server Years
- A thousand years
- Waiting ten thousand years for Love
- system server crash//Android 信号
- HDU 2653 Waiting ten thousand years for Love
- HDU 2653 Waiting ten thousand years for Love
- HDU 2619 - Love you Ten thousand years (数论)
- HDU 2653Waiting ten thousand years for Love
- Ten years. One vision
- Computing System Reliability: Models and Analysis
- One hundred years of uncertainty
- Add thousand separator (SQL Server 2005)
- Per-Server/Per-Seat 许可证的区别
- The Star Zoo——5、Waiting for fifteen thousand years
- hdoj 2653 Waiting ten thousand years for Love(优先队列+BFS)
- Hdu 2579 Dating with girls(2) && hdu 2653 Waiting ten thousand years for Love【Bfs】
- HDU2653 Waiting ten thousand years for Love (三维广搜+优先队列)
- hdu 2619 Love you Ten thousand years(数论,待解决)
- 搭建svn服务器
- iOS-UIMenuController 和 UIResponderStandardEditActions
- [工作笔记之二] 测试工作小结
- 图的点着色、区间着色问题及其应用(基于贪心思想的DFS回溯法求点着色问题和区间着色算法求解任务调度问题)
- Unity—UGUI
- System Reliability - One Crash per Thousand Server Years
- 留待解决的问题
- 传统企业转战电商必看(独家视角)
- Android收起通知栏
- 【Linux系统】内存管理(二)
- Android最佳实践之:StrictMode介绍
- 8.4 Optimizing Database Structure 优化数据库结构
- 使用JavaScript或Jquery获取标准下拉框的"选中值"和"选中文本"
- Linux 多线程应用中如何编写安全的信号处理函数