【zookeeper】错误处理机制

来源：互联网发布：linux php5.6编译参数编辑：程序博客网时间：2024/05/01 14:03

为什么要了解Zookeeper的错误处理机制？

Life would be so much easier if failures never happened. Of course, without failures,much of the need for ZooKeeper would also go away. To effectively use ZooKeeper it is important to understand the kinds of failures that happen and how to handle them.

要有效地使用ZooKeeper，理解错误如何出现和如何处理是很重要的。

Zookeeper暴露两种错误（可恢复的和不可恢复的）

ZooKeeper exposes two classes of failures: recoverable andunrecoverable.

Recoverable failures are transient and should be considered relatively normal—things happen. Brief network hiccups and server failures can cause these kinds of failures. Developers should write their code so that their applications keep running in spite of these failures.
Unrecoverable failures are much more problematic. These kinds of failures cause the ZooKeeper handle to become inoperable. The easiest and most common way to deal with this kind of failure is to exit the application. Examples of causes of this class of failure are session timeouts, network outages for longer than the session timeout, and authentication failures.

Zookeeper暴露了两个类故障：可回收和不可恢复的。
可恢复故障是短暂的，也是比较正常的。短暂的网络故障和服务器故障可能会导致这类故障。开发人员应该写自己的代码，以便保证应用程序能够一直运行。

不可恢复的故障比较棘手。这些类型的故障导致Zookeeper处理无法操作。最简单常用的方法是退出应用程序。这类故障的原因可能是会话超时，网络中断的时间大于会话超时时间，验证失败。

可恢复故障

A typical cause ofDisconnected events and ConnectionLossExceptions is a ZooKeeper server failure

Figure 5-5 illustrates the corner case that causes us to miss the creation event of a watched znode. In this example, the client is watching for the creation of /event. However, just as the /event is created by another client, the watching client loses its connection to ZooKeeper. During this time the other client deletes/event, so when the watching client reconnects to ZooKeeper and reregisters its watch, the ZooKeeper server no longer has the/event znode. Thus, when it processes the registered watches and sees the watch for/event, and sees that there is no node called/event, it simply reregisters the watch,causing the client to miss the creation event for /event. Because of this corner case, you should try to avoid watching for the creation event of a znode. If you do watch for a creation event it should be for a long-lived znode; otherwise, this corner case can bite you.

不可恢复故障

1. At t1,c1becomes unresponsive due to overload and stops communicating with ZooKeeper. It has queued up changes to the external resource but has not yet received the CPU cycles to send them.
2. At t2, ZooKeeper declares c1’s session with ZooKeeper dead. At this time it also deletes all ephemeral nodes associated withc1’s sessions, including the ephemeral node that it created to become the master.
3. At t3,c2becomes the master.
4. At t4,c2changes the state of the external resource.
5. At t5,c1’s overload subsides and it sends its queued changes to the external resource.
6. At t6,c1is able to reconnect to ZooKeeper, finds out that its session has expired, and relinquishes mastership. Unfortunately, the damage has been done: at time t5, changes were made to the external resource, resulting in corruption.

Figure 5-7 shows how this technique solves the scenario of Figure 5-6. Whenc1becomes the leader at time t1, the creation zxid of the /leader znode is 3 (in reality, the zxid would be a much larger number). It supplies the creation zxid as the fencing token to connect with the database. Later, when c1becomes unresponsive due to overload, ZooKeeper declares c1as failed andc2becomes the new leader at timet2.c2uses 4 as its fencing token because the /leader znode it created has a creation zxid of 4. At time t3,c2starts making requests to the database using its fencing token. Now when c1’s request arrives at the database at timet4, it is rejected because its fencing token (3) is lower than the highestseen fencing token (4), thus avoiding corruption.

0 0