fault tolerance in distributed system

来源：互联网发布：郭天祥单片机开发板编辑：程序博客网时间：2024/04/30 03:01

1. What's fault tolerance

首先应该想清楚的，什么是fault tolerance，容错嘛，就是在遇到exception/error的时候，对于当前请求能优雅地fail, 对于其它正常的请求，能继续提供正常的service。一个常见的不容错的系统的例子，就是一个定时任务时不时地莫名退出了，不再定时执行，一查可能是在执行某次定时任务时有nullpointerexception没有捕获，线程就退出了，这种情况就是典型的不容错，一个请求的异常导致系统无法为其它请求提供服务。这就是我理解的fault tolerance。

然后引用一段Programming Erlang中的话：

Erlang was originally designed for programming fault-tolerant systems, systems that in principle should never stop. This means that dealing with errors at runtime is crucially important. We take error handling very seriously in Erlang. When errors occur, we need to detect them, correct them, and continue.

2. Let it crash philosophy

点解akka/erlang需要使用和单节点应用(不局限于是单线程应用)使用不一样的fault tolerance paradigm呢？ Akka actor的职责应尽可能单一纯粹，对于actor实例来说，一方面很多错误都很难自己恢复，一个节点坏了就是坏了，在分布式应用中我们应该考虑换个节点，另外从代码质量来说也不应该加那么多恢复的逻辑，否则actor代码都是异常处理代码的话很不优雅，所以使用supervision tree机制，单节点的就不存在supervisor.

换句话说，分布式系统中，一个节点有问题，我们可以通过换个节点解决问题，在节点内try-catch显然不太好用的嘛，需要上层节点来处理。而单节点应用就只能try-catch。

再看看Programming Erlang中的论述：

Typical Erlang applications are composed of dozens to millions of concurrent processes. Having large numbers of processes changes how we think about error handling.In a sequential language with only one process, it is crucially important that this process does not crash. If we have large numbers of processes, it is not so important if a process crashes, provided some other process can detect the crash and take over whatever the crashed process was supposed to be doing.
To build really fault-tolerant systems, we need more than one computer; after all, the entire computer might crash. So, the idea of detecting failure and resuming the computation elsewhere has to be extended to networked computers.

3. fault tolerance strategy closure

Restart, stop, ignore...

Programming Erlang中描述的处理ERROR的两大原则我非常赞同，也是在实践中必须坚持的，前段时间在我们的API产品实践中客户也明确提出要fail fast。

Programming Erlang: We need to consider two key principles when coding for errors. First,we should fail as soon as an error occurs, and we should fail noisily. Several
programming languages adopt the principle of failing silently, trying to fix up the error and continuing; this results in code that is a nightmare to debug. In Erlang, when an error is detected internally by the system or is detected by program logic, the correct approach is to crash immediately and generate a meaningful error message. We crash immediately so as not to make matters worse. The error message should be written to a permanent error log and be sufficiently detailed so that we can figure out what went wrong later.
Second, fail politely means that only the programmer should see the detailed error messages produced when a program crashes.A user of the program
should never see these messages. On the other hand, the user should be alerted to the fact that an error has occurred and be told what action they can take to remedy the error.

Reference:

[1] Programming Erlang, 2nd Edition.

0 0