Android watchdog
来源:互联网 发布:js中的find方法 编辑:程序博客网 时间:2024/05/19 19:14
1. Watchdog 简介
Android 为了保证系统的高可用性,设计了Watchdog用以监视系统的一些关键服务的运行状况,如果关键服务出现了死锁,将重启SystemServer;另外,接收系统内部reboot请求,重启系统。
总结一下:Watchdog就如下两个主要功能:
- 接收系统内部reboot请求,重启系统;
- 监控系统关键服务,如果关键服务出现了死锁,将重启SystemServer。
被监控的关键服务,这些服务必须实现Watchdog.Monitor接口:
ActivityManagerService
InputManagerService
MountService
NativeDaemonConnector
NetworkManagementService
PowerManagerService
WindowManagerService
MediaRouterService
MediaProjectionManagerService
2. Watchdog 详解
Watchdog 是在SystemServer启动的时候 调用 startOtherServices 启动的。 Watchdog 初始化了一个单例的对象并且继承自 Thread,因此,Watchdog实际是跑在 SystemServer 进程中的。
watchdog初始化:
private Watchdog() { super("watchdog"); // Initialize handler checkers for each common thread we want to check. Note // that we are not currently checking the background thread, since it can // potentially hold longer running operations with no guarantees about the timeliness // of operations there. // The shared foreground thread is the main checker. It is where we // will also dispatch monitor checks and do other work. mMonitorChecker = new HandlerChecker(FgThread.getHandler(), "foreground thread", DEFAULT_TIMEOUT); mHandlerCheckers.add(mMonitorChecker); // Add checker for main thread. We only do a quick check since there // can be UI running on the thread. mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()), "main thread", DEFAULT_TIMEOUT)); // Add checker for shared UI thread. mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(), "ui thread", DEFAULT_TIMEOUT)); // And also check IO thread. mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(), "i/o thread", DEFAULT_TIMEOUT)); // And the display thread. mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(), "display thread", DEFAULT_TIMEOUT)); // Initialize monitor for Binder threads. addMonitor(new BinderThreadMonitor()); }
初始化默认添加foreground thread,main thread,UI thread,i/o thread,display thread .
/** * Call blocks until the number of executing binder threads is less * than the maximum number of binder threads allowed for this process. * @hide */ public static final native void blockUntilThreadAvailable();
启动之后,watchdog的run进程会每30s检查一次监控服务是否发生死锁。检查死锁通过hc.scheduleCheckLocked(),然后调用各个被监控对象的monitor()来验证。下面我们以 ActivityManagerService 为例。
/** In this method we try to acquire our lock to make sure that we have not deadlocked */ public void monitor() { synchronized (this) { } }
由于我们关键部分都用了synchronized (this) 这个锁来进行锁定,如果我们在monitor()的时候两次每隔30s(debug状态下为5s)的检查都未能获取到相应的锁,就表示这个进程死锁,如果死锁将杀死SystemServer进程(Watchdog跑在SystemServer进程中,因此Process.killProcess(Process.myPid()) 这里的myPid()就是SystemServer对应的PID)。
SystemServer 进程被杀死之后, Zygote 也会死掉(com_android_internal_os_Zygote.cpp 中通过 signal 机制 收到 SIGCHLD 就杀掉Zygote进程),最后init进程(init.rc中配置了onrestart,则就会有SVC_RESTARTING标签,init.cpp执行到restart_processes())检测到zygote死掉()会重新启动Zygote 和 SystemServer。
下面,我们结合代码来详细看下这个流程:
@Override public void run() { boolean waitedHalf = false; while (true) { final ArrayList<HandlerChecker> blockedCheckers; final String subject; final boolean allowRestart; int debuggerWasConnected = 0; synchronized (this) { long timeout = CHECK_INTERVAL; // Make sure we (re)spin the checkers that have become idle within // this wait-and-check interval for (int i=0; i<mHandlerCheckers.size(); i++) { HandlerChecker hc = mHandlerCheckers.get(i); // 1. 对每个关注的服务进行监控 hc.scheduleCheckLocked(); } if (debuggerWasConnected > 0) { debuggerWasConnected--; } // NOTE: We use uptimeMillis() here because we do not want to increment the time we // wait while asleep. If the device is asleep then the thing that we are waiting // to timeout on is asleep as well and won't have a chance to run, causing a false // positive on when to kill things. long start = SystemClock.uptimeMillis(); while (timeout > 0) { if (Debug.isDebuggerConnected()) { debuggerWasConnected = 2; } try { // 2. 等待timeout时间,默认30s wait(timeout); } catch (InterruptedException e) { Log.wtf(TAG, e); } if (Debug.isDebuggerConnected()) { debuggerWasConnected = 2; } timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start); } // 3. 获取监控之后的waitState状态,如果状态为COMPLETED、WAITING、WAITED_HALF,就结束本次循环,继续执行后面的循环;如果是OVERDUE状态,则执行OVERDUE相关逻辑,打印log、结束进程。 final int waitState = evaluateCheckerCompletionLocked(); if (waitState == COMPLETED) { // The monitors have returned; reset waitedHalf = false; continue; } else if (waitState == WAITING) { // still waiting but within their configured intervals; back off and recheck continue; } else if (waitState == WAITED_HALF) { if (!waitedHalf) { // We've waited half the deadlock-detection interval. Pull a stack // trace and wait another half. ArrayList<Integer> pids = new ArrayList<Integer>(); pids.add(Process.myPid()); ActivityManagerService.dumpStackTraces(true, pids, null, null, NATIVE_STACKS_OF_INTEREST); waitedHalf = true; } continue; } // 4. OVERDUE状态,则执行OVERDUE相关逻辑,打印log、结束进程。 // something is overdue! blockedCheckers = getBlockedCheckersLocked(); subject = describeCheckersLocked(blockedCheckers); allowRestart = mAllowRestart; } // If we got here, that means that the system is most likely hung. // First collect stack traces from all threads of the system process. // Then kill this process so that the system will restart. EventLog.writeEvent(EventLogTags.WATCHDOG, subject); ArrayList<Integer> pids = new ArrayList<Integer>(); pids.add(Process.myPid()); if (mPhonePid > 0) pids.add(mPhonePid); // 5. dump AMS 堆栈信息 // Pass !waitedHalf so that just in case we somehow wind up here without having // dumped the halfway stacks, we properly re-initialize the trace file. final File stack = ActivityManagerService.dumpStackTraces( !waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST); // Give some extra time to make sure the stack traces get written. // The system's been hanging for a minute, another second or two won't hurt much. SystemClock.sleep(2000); // 6. dump kernel 堆栈信息 // Pull our own kernel thread stacks as well if we're configured for that if (RECORD_KERNEL_THREADS) { dumpKernelStackTraces(); } // 7. 触发 kernel dump 所有阻塞的线程信息 和 所有CPU的backtraces放到 kernel 的 log 中 // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log doSysRq('w'); doSysRq('l'); // 8. 尝试把错误信息放大dropbox里面,这个假设AMS还活着,如果AMS死锁了,那watchdog也死锁了 // Try to add the error to the dropbox, but assuming that the ActivityManager // itself may be deadlocked. (which has happened, causing this statement to // deadlock and the watchdog as a whole to be ineffective) Thread dropboxThread = new Thread("watchdogWriteToDropbox") { public void run() { mActivity.addErrorToDropBox( "watchdog", null, "system_server", null, null, subject, null, stack, null); } }; dropboxThread.start(); try { dropboxThread.join(2000); // wait up to 2 seconds for it to return. } catch (InterruptedException ignored) {} // 9. ActivityController 检查 systemNotResponding(subject) 的处理方式,1 = keep waiting, -1 = kill system IActivityController controller; synchronized (this) { controller = mController; } if (controller != null) { Slog.i(TAG, "Reporting stuck state to activity controller"); try { Binder.setDumpDisabled("Service dumps disabled due to hung system process."); // 1 = keep waiting, -1 = kill system int res = controller.systemNotResponding(subject); if (res >= 0) { Slog.i(TAG, "Activity controller requested to coninue to wait"); waitedHalf = false; continue; } } catch (RemoteException e) { } } // Only kill the process if the debugger is not attached. if (Debug.isDebuggerConnected()) { debuggerWasConnected = 2; } if (debuggerWasConnected >= 2) { Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process"); } else if (debuggerWasConnected > 0) { Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process"); } else if (!allowRestart) { Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process"); } else { // 10. 打印堆栈信息 Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject); for (int i=0; i<blockedCheckers.size(); i++) { Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:"); StackTraceElement[] stackTrace = blockedCheckers.get(i).getThread().getStackTrace(); for (StackTraceElement element: stackTrace) { Slog.w(TAG, " at " + element); } } Slog.w(TAG, "*** GOODBYE!"); // 11. 杀死进程 Process.killProcess(Process.myPid()); System.exit(10); } waitedHalf = false; } }
public void scheduleCheckLocked() { if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) { // If the target looper has recently been polling, then // there is no reason to enqueue our checker on it since that // is as good as it not being deadlocked. This avoid having // to do a context switch to check the thread. Note that we // only do this if mCheckReboot is false and we have no // monitors, since those would need to be executed at this point. mCompleted = true; return; } if (!mCompleted) { // we already have a check in flight, so no need return; } mCompleted = false; mCurrentMonitor = null; mStartTime = SystemClock.uptimeMillis(); mHandler.postAtFrontOfQueue(this); }
然后去寻找isPolling()代码:
/** * Returns whether this looper's thread is currently polling for more work to do. * This is a good signal that the loop is still alive rather than being stuck * handling a callback. Note that this method is intrinsically racy, since the * state of the loop can change before you get the result back. * * <p>This method is safe to call from any thread. * * @return True if the looper is currently polling for events. * @hide */ public boolean isPolling() { synchronized (this) { return isPollingLocked(); } } private boolean isPollingLocked() { // If the loop is quitting then it must not be idling. // We can assume mPtr != 0 when mQuitting is false. return !mQuitting && nativeIsPolling(mPtr); }
最终通过messageQueue的polling方法确认该looper是否alive来判断线程是否活动。
- Android-Watchdog
- Android WatchDog
- Android watchdog
- some watchdog in android
- Android中的WatchDog
- Android中的WatchDog
- Android WatchDog分析
- Android中的WatchDog
- Android中的WatchDog
- android -- WatchDog看门狗分析
- Android中的WatchDog
- Android watchdog分析
- android -- WatchDog看门狗分析
- android Watchdog 看门狗
- watchdog in android
- android -- WatchDog看门狗分析
- Android WatchDog正解
- Android WatchDog分析
- Leetcode Binary Tree Level Order Traversal
- 周报2017.7.3-2017.7.7
- Java资源大全及包介绍
- javascript日期格式化函数。Format函数
- 001 JavaWeb之HTML
- Android watchdog
- Linux下搭建SVN服务器
- com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
- linux目录结构
- 基于S3c244的input输入子系统
- 数据库原理期末考试题(经典题型)
- java不可不知的22个知识点
- SVN与git
- jsp利用变量显示指定的图片