android-O RescueParty 介绍

来源:互联网 发布:网络摄像机软件下载 编辑:程序博客网 时间:2024/06/08 03:55
一. 概述
Android系统在很多情况下都会进入到一种无法自主恢复的状态下:例如无法开机,常驻系统进程无限crash等等,往往在这些情况下手机已经无法正常使用了,像这些情况小白用户往往都不知道怎么修复手机,只能送回售后了。在O上加了一个救援的机制就是来解决这些问题的,这个机制叫:RescueParty
RescueParty的原理大致为:同一个uid的应用发生多次异常,RescueParty会根据该uid记录发生的次数,当次数达到默认次数后会调整拯救的策略。拯救策略等级分为:
1.NONE
2.RESET_SETTINGS_UNTRUSTED_DEFAULTS
3.RESET_SETTINGS_UNTRUSTED_CHANGES
4.RESET_SETTINGS_TRUSTED_DEFAULTS
5.FACTORY_RESET
最终的拯救策略是进recovery模式。

那么哪些场景会造成触发这个机制呢?
1.a persistent app is stuck in a crash loop
2.we're stuck in a runtime restart loop.
二.RescueParty 原理介绍
RescueParty的原理我们从第一点“a persistent app is stuck in a crash”来说,appCrash的流程这里就不多说了,看一张时序图好了:

O上在AppErrors.java的crashApplicationInner方法中加上了RescueParty监控,具体代码如下:
void crashApplicationInner(ProcessRecord r, ApplicationErrorReport.CrashInfo crashInfo,int callingPid, int callingUid) {。。。// If a persistent app is stuck in a crash loop, the device isn't very// usable, so we want to consider sending out a rescue party.if (r != null && r.persistent) {RescueParty.notePersistentAppCrash(mContext, r.uid);}AppErrorResult result = new AppErrorResult();TaskRecord task; 。。。}


这里调用了 RescueParty的notePersistentAppCrash方法,并传入了Context和进程uid.现在我们进入方法内部看看:
/*** Take note of a persistent app crash. If we notice too many of these* events happening in rapid succession, we'll send out a rescue party.*/public static void notePersistentAppCrash(Context context, int uid) {if (isDisabled()) return;Threshold t = sApps.get(uid);if (t == null) {t = new AppThreshold(uid);sApps.put(uid, t);}if (t.incrementAndTest()) {t.reset();incrementRescueLevel(t.uid);executeRescueLevel(context);}}
首先先进行了一个RescueParty机制是否被禁用了的的判断,我们看看什么情况下会被禁用:
禁用的情况分为以下几种情况:
1.eng版本会被禁用.
2.userdebug版本,并且usb正在连接中.
3.getprop persist.sys.disable_rescue 为true.
其他情况都没有被禁用
然后我们继续回到notePersistentAppCrash方法中来,如果RescueParty机制没有被禁用,我们继续往下:
Threshold t = sApps.get(uid);if (t == null) {t = new AppThreshold(uid);sApps.put(uid, t);}if (t.incrementAndTest()) {t.reset();incrementRescueLevel(t.uid);executeRescueLevel(context);}
我们先看看sApps的定义:
/** Threshold for app crash loops */private static SparseArray<Threshold> sApps = new SparseArray<>();
每一个uid会对应一个Threshold对象,这里会根据uid取得对应的Threshold对象,如果Threshold对象为Null,那么久new一个Threshold对象,然后放到sApps中。紧接着会调用incrementAndTest方法,看看incrementAndTest方法中做了什么:
/*** @return if this threshold has been triggered*/public boolean incrementAndTest() {final long now = SystemClock.elapsedRealtime();final long window = now - getStart();if (window > triggerWindow) {setCount(1);setStart(now);return false;} else {int count = getCount() + 1;setCount(count);EventLogTags.writeRescueNote(uid, count, window);Slog.w(TAG, "Noticed " + count + " events for UID " + uid + " in last "+ (window / 1000) + " sec");return (count >= triggerCount);}}
这里我们分别来看看getStart/setStart/setCount/getCount方法:
private static class BootThreshold extends Threshold {public BootThreshold() {// We're interested in 5 events in any 300 second period; this// window is super relaxed because booting can take a long time if// forced to dexopt things.super(android.os.Process.ROOT_UID, 5, 300 * DateUtils.SECOND_IN_MILLIS);}@Overridepublic int getCount() {return SystemProperties.getInt(PROP_RESCUE_BOOT_COUNT, 0);}@Overridepublic void setCount(int count) {SystemProperties.set(PROP_RESCUE_BOOT_COUNT, Integer.toString(count));}@Overridepublic long getStart() {return SystemProperties.getLong(PROP_RESCUE_BOOT_START, 0);}@Overridepublic void setStart(long start) {SystemProperties.set(PROP_RESCUE_BOOT_START, Long.toString(start));}}
这里其实就是把时间,次数保存到了Properties文件中。
从上边的代码中我们可以看到BootThreshold继承了Threshold并调用了它的构造方法:
super(android.os.Process.ROOT_UID, 5, 300 * DateUtils.SECOND_IN_MILLIS);private abstract static class Threshold {。。。public Threshold(int uid, int triggerCount, long triggerWindow) {this.uid = uid;this.triggerCount = triggerCount;this.triggerWindow = triggerWindow;}。。。}
从这里我们可以知道triggerWindow的值为300000,triggerCount的值为5.
到现在我们已经知道了incrementAndTest方法的具体含义了:
如果两次crash的时间差大于300000,那么就设置次数为1,并把时间设置为当前时间(重置时间和次数),否则就次数加1,然后保存次数。并判断当前次数是否大于triggerCount(5),大于就返回true,返回true后会分别执行:
t.reset();incrementRescueLevel(t.uid);executeRescueLevel(context);
我们分别看看三个方法的实现:
public void reset() {setCount(0);setStart(0);}
将次数和时间分别设置为0。
/*** Escalate to the next rescue level. After incrementing the level you'll* probably want to call {@link #executeRescueLevel(Context)}.*/private static void incrementRescueLevel(int triggerUid) {final int level = MathUtils.constrain(SystemProperties.getInt(PROP_RESCUE_LEVEL, LEVEL_NONE) + 1,LEVEL_NONE, LEVEL_FACTORY_RESET);SystemProperties.set(PROP_RESCUE_LEVEL, Integer.toString(level));EventLogTags.writeRescueLevel(level, triggerUid);PackageManagerService.logCriticalInfo(Log.WARN, "Incremented rescue level to "+ levelToString(level) + " triggered by UID " + triggerUid);}
这段代码其实就是取出当前所在的等级,加1后在存到properties中。
private static void executeRescueLevel(Context context) {final int level = SystemProperties.getInt(PROP_RESCUE_LEVEL, LEVEL_NONE);if (level == LEVEL_NONE) return;Slog.w(TAG, "Attempting rescue level " + levelToString(level));try {executeRescueLevelInternal(context, level);EventLogTags.writeRescueSuccess(level);PackageManagerService.logCriticalInfo(Log.DEBUG,"Finished rescue level " + levelToString(level));} catch (Throwable t) {final String msg = ExceptionUtils.getCompleteMessage(t);EventLogTags.writeRescueFailure(level, msg);PackageManagerService.logCriticalInfo(Log.ERROR,"Failed rescue level " + levelToString(level) + ": " + msg);}}


这里先取出当前的等级,判断等级是否为NONE,如果不是就会去调用executeRescueLevelInternal方法,我们接着看executeRescueLevelInternal方法做了什么:
private static void executeRescueLevelInternal(Context context, int level) throws Exception {switch (level) {case LEVEL_RESET_SETTINGS_UNTRUSTED_DEFAULTS:resetAllSettings(context, Settings.RESET_MODE_UNTRUSTED_DEFAULTS);break;case LEVEL_RESET_SETTINGS_UNTRUSTED_CHANGES:resetAllSettings(context, Settings.RESET_MODE_UNTRUSTED_CHANGES);break;case LEVEL_RESET_SETTINGS_TRUSTED_DEFAULTS:resetAllSettings(context, Settings.RESET_MODE_TRUSTED_DEFAULTS);break;case LEVEL_FACTORY_RESET:RecoverySystem.rebootPromptAndWipeUserData(context, TAG);break;}}
这里根据不同的等级来救我们的系统,总共有四级,分别为:
1.LEVEL_RESET_SETTINGS_UNTRUSTED_DEFAULTS
2.LEVEL_RESET_SETTINGS_UNTRUSTED_CHANGES
3.LEVEL_RESET_SETTINGS_TRUSTED_DEFAULTS
4.LEVEL_FACTORY_RESET
接下来看看每一级做了些什么事情,前面的三级都是调用了resetAllSettings方法,那就先看看resetAllSettings方法吧:
private static void resetAllSettings(Context context, int mode) throws Exception {// Try our best to reset all settings possible, and once finished// rethrow any exception that we encounteredException res = null;final ContentResolver resolver = context.getContentResolver();try {Settings.Global.resetToDefaultsAsUser(resolver, null, mode, UserHandle.USER_SYSTEM);} catch (Throwable t) {res = new RuntimeException("Failed to reset global settings", t);}for (int userId : getAllUserIds()) {try {Settings.Secure.resetToDefaultsAsUser(resolver, null, mode, userId);} catch (Throwable t) {res = new RuntimeException("Failed to reset secure settings for " + userId, t);}}if (res != null) {throw res;}}


这里其实就是根据不同的等级尽最大的努力重置所有可能的设置,对这里感兴趣的可以详细看一下。我们接下来看看最后一个等级,它调用了RecoverySystem类里的rebootPromptAndWipeUserData方法,这里其实就是让系统进recovery模式了,详细流程就不说了,看个调用栈吧:
"Binder:1313_18@9485" prio=5 tid=0xbe nid=NA waitingjava.lang.Thread.State: WAITINGblocks Binder:1313_18@9485waiting for android.ui@9431 to release lock on <0x2562> (a com.android.server.power.PowerManagerService$4)at java.lang.Object.wait(Object.java:-1)at com.android.server.power.PowerManagerService.shutdownOrRebootInternal(PowerManagerService.java:2802)locked <0x2562> (a com.android.server.power.PowerManagerService$4)at com.android.server.power.PowerManagerService.-wrap35(PowerManagerService.java:-1)at com.android.server.power.PowerManagerService$BinderService.reboot(PowerManagerService.java:4483)at android.os.PowerManager.reboot(PowerManager.java:969)at com.android.server.RecoverySystemService$BinderService.rebootRecoveryWithCommand(RecoverySystemService.java:193)locked <0x25e1> (a java.lang.Object)at android.os.RecoverySystem.rebootRecoveryWithCommand(RecoverySystem.java:1146)at android.os.RecoverySystem.bootCommand(RecoverySystem.java:925)at android.os.RecoverySystem.rebootPromptAndWipeUserData(RecoverySystem.java:855)at com.android.server.RescueParty.executeRescueLevelInternal(RescueParty.java:190)at com.android.server.RescueParty.executeRescueLevel(RescueParty.java:166)at com.android.server.RescueParty.notePersistentAppCrash(RescueParty.java:126)at com.android.server.am.AppErrors.crashApplicationInner(AppErrors.java:343)at com.android.server.am.AppErrors.crashApplication(AppErrors.java:322)at com.android.server.am.ActivityManagerService.handleApplicationCrashInner(ActivityManagerService.java:14621)at com.android.server.am.ActivityManagerService.handleApplicationCrash(ActivityManagerService.java:14603)at android.app.IActivityManager$Stub.onTransact(IActivityManager.java:79)at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:3011)at android.os.Binder.execTransact(Binder.java:677)
最终会调用到PowerManagerService的lowLevelReboot方法。
三.RescueParty监控的业务
发在本文最开始就已经说了在哪些场景会造成触发这个机制:
  • a persistent app is stuck in a crash loop
  • we're stuck in a runtime restart loop.
第一种情况在原理介绍的时候已经说了,就是app连续crash的时候会触发,接下来我们看看另外一种情况:
we're stuck in a runtime restart loop:
这个其实就是监控手机是不是一直在无限重启,我们看看它怎么实现监控开机的:
private void startBootstrapServices() {。。。// Now that we have the bare essentials of the OS up and running, take// note that we just booted, which might send out a rescue party if// we're stuck in a runtime restart loop. RescueParty.noteBoot(mSystemContext);// Manages LEDs and display backlight so we need it to bring up the display. traceBeginAndSlog("StartLightsService"); 。。。}

在system_server启动的时候在startBootstrapServices方法里会调用noteBoot方法,我们可以继续看看noteBoot方法:
/*** Take note of a boot event. If we notice too many of these events* happening in rapid succession, we'll send out a rescue party.*/public static void noteBoot(Context context) {if (isDisabled()) return;if (sBoot.incrementAndTest()) {sBoot.reset();incrementRescueLevel(sBoot.uid);executeRescueLevel(context);}}}

看到这我们就很熟悉了,这里其实也是根据时间来记录次数,到达默认次数后会升级处理对策。最后的一个策略就是进入recovery了。

三.总结
RescueParty实际上就统计一段时间内某个进程有没有在不断的crash,如果是的话就按照crash的次数来分等级处理,最后一个等级是进入recovery模式,让用户自主格式化数据来拯救无法恢复的手机。