Akka 【四】 Supervision and Monitoring

来源：互联网发布：和大学老师啪啪知乎编辑：程序博客网时间：2024/05/16 12:20

什么是Supervision

在之前介绍Actor Model的文章中提到了，上级Actor对于它的下级Actor有监管作用：因为下级Actor的任务是上级Actor分配的，因此当下级Actor出错的时候，上级Actor必须做出回应。当一个Actor监测到异常的发生，它会挂起自己和它的所有下级Actor（所谓的挂起应该就是指不再执行任何任务），并且发送消息到它的监督者（也就是它的上级）表明自己出现了异常。监督者对错误做出相应的处理。

Supervision策略

监督者面对其下属的失败，有不同的策略。不过大致可以分为已下的四类：

1）恢复下级Actor，并且保持下级Actor的内部状态

2）重新启动下级Actor，并且清除下级Actor的内部状态

3）永久的停止下级Actor

4）将错误逐层上传，从而暂停自己（有点像java中的throw exception）

其中需要知道的是，Akka中的每一个Actor都在监督树中，他们既可以扮演监督者的角色，也可以扮演被监督者的角色。上级Actor的状态直接影响着下级Actor的状态，因此对上面的前三条策略可以做如下补充（诠释）

1）当恢复某个Actor的时候同时也要恢复它的所有下级Actor

2）重新启动某个Actor的时候也要重启它所有的下级Actor

3）停止某个Actor的时候也需要停止它所有的下级Actor。

Actor的preRestart方法的默认行为就是：在这个Actor重启前，先终止它所有的下级Actor，这个过程其实就是一个递归的过程。但是这个方法是可以重写的，因此在重写的时候需要谨慎。

UntypedActor中preRestart方法，

  /**   * User overridable callback: '''By default it disposes of all children and then calls `postStop()`.'''   * <p/>   * Is called on a crashed Actor right BEFORE it is restarted to allow clean   * up of resources before Actor is terminated.   */  @throws(classOf[Exception])  override def preRestart(reason: Throwable, message: Option[Any]): Unit = super.preRestart(reason, message)

继续调用父类的preRestart方法，也就是Actor中的方法。

Actor中的preRestart方法，

  /**   * User overridable callback: '''By default it disposes of all children and then calls `postStop()`.'''   * @param reason the Throwable that caused the restart to happen   * @param message optionally the current message the actor processed when failing, if applicable   * <p/>   * Is called on a crashed Actor right BEFORE it is restarted to allow clean   * up of resources before Actor is terminated.   */  @throws(classOf[Exception]) // when changing this you MUST also change UntypedActorDocTest  //#lifecycle-hooks  def preRestart(reason: Throwable, message: Option[Any]): Unit = {    context.children foreach { child ⇒      context.unwatch(child)      context.stop(child)    }    postStop()  }

可以较为清晰的看到，针对每一个child，都进行了stop操作。

Actor通过某个配置函数，来决定针对特定异常采用什么策略。

来看看默认的监督策略，

  /**   * When supervisorStrategy is not specified for an actor this   * [[Decider]] is used by default in the supervisor strategy.   * The child will be stopped when [[akka.actor.ActorInitializationException]],   * [[akka.actor.ActorKilledException]], or [[akka.actor.DeathPactException]] is   * thrown. It will be restarted for other `Exception` types.   * The error is escalated if it's a `Throwable`, i.e. `Error`.   */  final val defaultDecider: Decider = {    case _: ActorInitializationException ⇒ Stop    case _: ActorKilledException         ⇒ Stop    case _: DeathPactException           ⇒ Stop    case _: Exception                    ⇒ Restart  }  /**   * When supervisorStrategy is not specified for an actor this   * is used by default. OneForOneStrategy with decider defined in   * [[#defaultDecider]].   */  final val defaultStrategy: SupervisorStrategy = {    OneForOneStrategy()(defaultDecider)  }

默认情况下采用的策略是OneForOneStratety，在之后会做详细介绍。并且默认情况下，使用defaultDecider来决定某种异常采用什么措施进行处理。这里面的措施就是本章节开头提到的四个策略。

针对ActorInitializationException，ActorKilledException，DeathPactException采用Stop策略，而其他Exception采用Restart策略。

TOP-LEVEL监督者

一个Actor系统至少会在开始的时候创造3个Actor，如下面所示，

/user：The Guardian Actor

普通的Actor最可能和所有user-created Actor打交道，这个监督者叫做/user。所有使用system.actorOf()都是这个监督者的子Actor。这就意味着当这个监督者终止的时候，系统中所有的用户级别的Actor都会被终止。这就意味着这个监督者的监督策略决定了所有用户级别的top-level Actor是如何被监督的。当/user把错误升级到/root的时候，/root会终止/user，根据前面提到的四个原则这就意味着停止整个系统。

/system：The System Guardian

所有system-created Actor的监督者，比如logging listeners。当普通Actor都终止的时候，日志模块仍然处于活跃状态。/system监听/user，当/user和/user所有的下级Actor都已经终止的时候，/system根据Terminated消息启动自己的终止行为。这个监督者的策略是：除了ActorInitializationException和ActorKilledException会直接终止top-level system Actor以外，其他exception都会重新启动top-level system Actor。

什么是restart

当一个Actor在处理某些消息的时候会发生故障，引起这些错误主要包含以下几个情况：
1）接收到的消息有系统错误（编程错误）
2）在处理消息的时候，外部资源故障
3）Actor内部状态错误

除非故障可以被明确的识别出来，否则第三个因素不能被忽略，并且在这种情况下Actor的内部状态需要被清除。当然如果监督者判定其他的子Actorr和它本身不受这个内部状态错误的影响，最好的方法就是重启这个子Actor。这是通过如下方式实现的：创建一个新的Actor，并且用这个新的Actor去替换ActorRef中的那个故障的实例。新的
Actor继续处理它的邮箱，这就意味着restart活动在这个Actor之外是不可见的。

下面是Restart的几个步骤：
1）挂起Actor（这就意味直到Actor被唤醒，他都不会再处理消息），同时递归的挂起所有的子Actor。
2）调用老的实例的preStart方法，默认情况下是发送termincation请求到所有的子Actor并且调用postStop方法，这个在之前也已经提到过了
3）等待所有的子Actor终止完成。和其他Actor操作一样，这个操作也是非阻塞的。当收到最后一个Actor的终止结束提示的时候，就会进行下一步
4）调用原始的工厂创建新的Actor实例
5）调用新的实例的postRestart方法，默认情况下也是调用preStart方法
6）发送重启请求到所有在第三步没有终止的Actor，这些将要被重启的子Actor也会递归的进行处理（从第2步开始）
7）唤醒Actor

什么是Monitoring

需要注意的是：Note: Lifecycle Monitoring in Akka is usually referred to as DeathWatch

和父Actor与子Actor之间特殊的关系不同，每一个Actor都可以monitor其他的Actor。由于Actor的重启除了supervisor以外，其他的Actor都看不到，因此monitoring唯一可以做的就是监控Actor从生到死的转换。Monitoring将一个Actorr绑定到其他多个Actor，以便这个Actor可以应对其他Actor的termination操作，而supervisioning所做的是处理failure。

那么是如何实现monitoring的？当一个Actor收到Terminated消息的时候，表明某个其他Actor已经被终止了。当没有其他处理方法时，默认会抛出一个DeathPactException。

当Actor调用ActorContext.watch(targetActorRef)的时候，monitoring活动就开始了。调用ActorContext.unwatch(targetActorRef)的时候，monitoring活动就结束了。

import java.util.concurrent.atomic.AtomicInteger;import akka.actor.ActorRef;import akka.actor.PoisonPill;import akka.actor.Props;import akka.actor.SupervisorStrategy;import akka.actor.Terminated;import akka.actor.UntypedActor;import akka.japi.Creator;public class Actor1 extends UntypedActor {private static final class CreatorImplementation implements Creator<Actor1> {/** *  */private static final long serialVersionUID = 1L;private AtomicInteger index = new AtomicInteger();public Actor1 create() throws Exception {// TODO Auto-generated method stubreturn new Actor1("Actor1", index.getAndAdd(1));}}String name;int index;public Actor1(String name, int index) {this.name = name;this.index = index;}@Overridepublic void onReceive(Object message) throws Exception {System.out.println(name + "_" + index + " receive message");System.out.println("message: " + message);if (message instanceof MSG) {if (message.equals(MSG.OK)) {System.out.println("i receive ok");getContext().watch(getSender());} else {unhandled(message);}} else if (message instanceof Terminated){System.out.println(getSender() + " terminate");} else {unhandled(message);}}public static Props props() {return Props.create(new CreatorImplementation());}}

当接收到的消息为MSG的时候，使用getContext().watch(getSender())监测Actor2的终止行为。当接收到的消息为Terminated的时候，打印出相应信息。

当Actor2终止的时候，Actor1会收到Terminated消息。

import java.util.concurrent.atomic.AtomicInteger;import akka.actor.Props;import akka.actor.UntypedActor;import akka.japi.Creator;public class Actor2 extends UntypedActor {private static final class CreatorImplementation implements Creator<Actor2> {/** *  */private static final long serialVersionUID = 1L;private AtomicInteger index = new AtomicInteger();public Actor2 create() throws Exception {// TODO Auto-generated method stubreturn new Actor2("Actor2", index.getAndAdd(1));}}String name;int index;public Actor2(String name, int index) {this.name = name;this.index = index;}@Overridepublic void onReceive(Object message) throws Exception {System.out.println(name + "_" + index + " receive message");if (message instanceof MSG) {if (message.equals(MSG.HI)) {System.out.println("i receive hi");getSender().tell(MSG.OK, getSelf());} else {unhandled(message);}} else {unhandled(message);}}public static Props props() {return Props.create(new CreatorImplementation());}}

import akka.actor.ActorRef;import akka.actor.ActorSystem;public class CreateTest {@SuppressWarnings("static-access")public void communication() {// 使用Props创建Actorfinal ActorSystem system = ActorSystem.create("MySystem");// 创建actor1final ActorRef actor1 = system.actorOf(Actor1.props(), "actor1");// 创建actor2final ActorRef actor2 = system.actorOf(Actor2.props(), "actor2");// actor1发送消息到actor2actor2.tell(MSG.HI, actor1);// 使得actor1和actor2有充分的时间处理消息try {Thread.currentThread().sleep(1000);} catch (InterruptedException e) {// TODO Auto-generated catch blocke.printStackTrace();}// 关闭actor2system.stop(actor2);system.shutdown();}public static void main(String[] args) {new CreateTest().communication();}}

运行后，出现如下消息

Actor2_0 receive messagei receive hiActor1_0 receive messagemessage: OKi receive okActor1_0 receive messagemessage: Terminated(Actor[akka://MySystem/user/actor2#-1091096557])Actor[akka://MySystem/user/actor2#-1091096557] terminate[INFO] [07/20/2014 12:14:41.522] [MySystem-akka.actor.default-dispatcher-2] [akka://MySystem/user/actor2] Message [akka.dispatch.sysmsg.Terminate] from Actor[akka://MySystem/user/actor2#-1091096557] to Actor[akka://MySystem/user/actor2#-1091096557] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.

可以看到Actor1收到了两次信息，第二次正好是来自Actor2的Terminated消息。

0 0