Android watchdog

来源:互联网 发布:js中的find方法 编辑:程序博客网 时间:2024/05/19 19:14

1. Watchdog 简介

Android 为了保证系统的高可用性,设计了Watchdog用以监视系统的一些关键服务的运行状况,如果关键服务出现了死锁,将重启SystemServer;另外,接收系统内部reboot请求,重启系统。

总结一下:Watchdog就如下两个主要功能:

  1. 接收系统内部reboot请求,重启系统;
  2. 监控系统关键服务,如果关键服务出现了死锁,将重启SystemServer。
    被监控的关键服务,这些服务必须实现Watchdog.Monitor接口:
    ActivityManagerService
    InputManagerService
    MountService
    NativeDaemonConnector
    NetworkManagementService
    PowerManagerService
    WindowManagerService
    MediaRouterService
    MediaProjectionManagerService

2. Watchdog 详解

一张图理解 Watchdog

一张图理解 Watchdog

Watchdog 是在SystemServer启动的时候 调用 startOtherServices 启动的。 Watchdog 初始化了一个单例的对象并且继承自 Thread,因此,Watchdog实际是跑在 SystemServer 进程中的。

watchdog初始化:

 private Watchdog() {        super("watchdog");        // Initialize handler checkers for each common thread we want to check.  Note        // that we are not currently checking the background thread, since it can        // potentially hold longer running operations with no guarantees about the timeliness        // of operations there.        // The shared foreground thread is the main checker.  It is where we        // will also dispatch monitor checks and do other work.        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),                "foreground thread", DEFAULT_TIMEOUT);        mHandlerCheckers.add(mMonitorChecker);        // Add checker for main thread.  We only do a quick check since there        // can be UI running on the thread.        mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),                "main thread", DEFAULT_TIMEOUT));        // Add checker for shared UI thread.        mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),                "ui thread", DEFAULT_TIMEOUT));        // And also check IO thread.        mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),                "i/o thread", DEFAULT_TIMEOUT));        // And the display thread.        mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),                "display thread", DEFAULT_TIMEOUT));        // Initialize monitor for Binder threads.        addMonitor(new BinderThreadMonitor());    }


初始化默认添加foreground thread,main thread,UI thread,i/o thread,display thread .

 /**     * Call blocks until the number of executing binder threads is less     * than the maximum number of binder threads allowed for this process.     * @hide     */    public static final native void blockUntilThreadAvailable();


启动之后,watchdog的run进程会每30s检查一次监控服务是否发生死锁。检查死锁通过hc.scheduleCheckLocked(),然后调用各个被监控对象的monitor()来验证。下面我们以 ActivityManagerService 为例。

    /** In this method we try to acquire our lock to make sure that we have not deadlocked */    public void monitor() {        synchronized (this) { }    }

由于我们关键部分都用了synchronized (this) 这个锁来进行锁定,如果我们在monitor()的时候两次每隔30s(debug状态下为5s)的检查都未能获取到相应的锁,就表示这个进程死锁,如果死锁将杀死SystemServer进程(Watchdog跑在SystemServer进程中,因此Process.killProcess(Process.myPid()) 这里的myPid()就是SystemServer对应的PID)。

SystemServer 进程被杀死之后, Zygote 也会死掉(com_android_internal_os_Zygote.cpp 中通过 signal 机制 收到 SIGCHLD 就杀掉Zygote进程),最后init进程(init.rc中配置了onrestart,则就会有SVC_RESTARTING标签,init.cpp执行到restart_processes())检测到zygote死掉()会重新启动Zygote 和 SystemServer。

下面,我们结合代码来详细看下这个流程:

@Override    public void run() {        boolean waitedHalf = false;        while (true) {            final ArrayList<HandlerChecker> blockedCheckers;            final String subject;            final boolean allowRestart;            int debuggerWasConnected = 0;            synchronized (this) {                long timeout = CHECK_INTERVAL;                // Make sure we (re)spin the checkers that have become idle within                // this wait-and-check interval                for (int i=0; i<mHandlerCheckers.size(); i++) {                    HandlerChecker hc = mHandlerCheckers.get(i);                    // 1. 对每个关注的服务进行监控                    hc.scheduleCheckLocked();                }                if (debuggerWasConnected > 0) {                    debuggerWasConnected--;                }                // NOTE: We use uptimeMillis() here because we do not want to increment the time we                // wait while asleep. If the device is asleep then the thing that we are waiting                // to timeout on is asleep as well and won't have a chance to run, causing a false                // positive on when to kill things.                long start = SystemClock.uptimeMillis();                while (timeout > 0) {                    if (Debug.isDebuggerConnected()) {                        debuggerWasConnected = 2;                    }                    try {                        // 2. 等待timeout时间,默认30s                        wait(timeout);                    } catch (InterruptedException e) {                        Log.wtf(TAG, e);                    }                    if (Debug.isDebuggerConnected()) {                        debuggerWasConnected = 2;                    }                    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);                }                // 3. 获取监控之后的waitState状态,如果状态为COMPLETED、WAITING、WAITED_HALF,就结束本次循环,继续执行后面的循环;如果是OVERDUE状态,则执行OVERDUE相关逻辑,打印log、结束进程。                final int waitState = evaluateCheckerCompletionLocked();                if (waitState == COMPLETED) {                    // The monitors have returned; reset                    waitedHalf = false;                    continue;                } else if (waitState == WAITING) {                    // still waiting but within their configured intervals; back off and recheck                    continue;                } else if (waitState == WAITED_HALF) {                    if (!waitedHalf) {                        // We've waited half the deadlock-detection interval.  Pull a stack                        // trace and wait another half.                        ArrayList<Integer> pids = new ArrayList<Integer>();                        pids.add(Process.myPid());                        ActivityManagerService.dumpStackTraces(true, pids, null, null,                                NATIVE_STACKS_OF_INTEREST);                        waitedHalf = true;                    }                    continue;                }                // 4. OVERDUE状态,则执行OVERDUE相关逻辑,打印log、结束进程。                // something is overdue!                blockedCheckers = getBlockedCheckersLocked();                subject = describeCheckersLocked(blockedCheckers);                allowRestart = mAllowRestart;            }            // If we got here, that means that the system is most likely hung.            // First collect stack traces from all threads of the system process.            // Then kill this process so that the system will restart.            EventLog.writeEvent(EventLogTags.WATCHDOG, subject);            ArrayList<Integer> pids = new ArrayList<Integer>();            pids.add(Process.myPid());            if (mPhonePid > 0) pids.add(mPhonePid);            // 5. dump AMS 堆栈信息            // Pass !waitedHalf so that just in case we somehow wind up here without having            // dumped the halfway stacks, we properly re-initialize the trace file.            final File stack = ActivityManagerService.dumpStackTraces(                    !waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST);            // Give some extra time to make sure the stack traces get written.            // The system's been hanging for a minute, another second or two won't hurt much.            SystemClock.sleep(2000);            // 6. dump kernel 堆栈信息            // Pull our own kernel thread stacks as well if we're configured for that            if (RECORD_KERNEL_THREADS) {                dumpKernelStackTraces();            }            // 7. 触发 kernel dump 所有阻塞的线程信息 和 所有CPU的backtraces放到 kernel 的 log 中            // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log            doSysRq('w');            doSysRq('l');            // 8. 尝试把错误信息放大dropbox里面,这个假设AMS还活着,如果AMS死锁了,那watchdog也死锁了            // Try to add the error to the dropbox, but assuming that the ActivityManager            // itself may be deadlocked.  (which has happened, causing this statement to            // deadlock and the watchdog as a whole to be ineffective)            Thread dropboxThread = new Thread("watchdogWriteToDropbox") {                    public void run() {                        mActivity.addErrorToDropBox(                                "watchdog", null, "system_server", null, null,                                subject, null, stack, null);                    }                };            dropboxThread.start();            try {                dropboxThread.join(2000);  // wait up to 2 seconds for it to return.            } catch (InterruptedException ignored) {}            // 9. ActivityController 检查 systemNotResponding(subject) 的处理方式,1 = keep waiting, -1 = kill system            IActivityController controller;            synchronized (this) {                controller = mController;            }            if (controller != null) {                Slog.i(TAG, "Reporting stuck state to activity controller");                try {                    Binder.setDumpDisabled("Service dumps disabled due to hung system process.");                    // 1 = keep waiting, -1 = kill system                    int res = controller.systemNotResponding(subject);                    if (res >= 0) {                        Slog.i(TAG, "Activity controller requested to coninue to wait");                        waitedHalf = false;                        continue;                    }                } catch (RemoteException e) {                }            }            // Only kill the process if the debugger is not attached.            if (Debug.isDebuggerConnected()) {                debuggerWasConnected = 2;            }            if (debuggerWasConnected >= 2) {                Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");            } else if (debuggerWasConnected > 0) {                Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");            } else if (!allowRestart) {                Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");            } else {                // 10. 打印堆栈信息                Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);                for (int i=0; i<blockedCheckers.size(); i++) {                    Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");                    StackTraceElement[] stackTrace                            = blockedCheckers.get(i).getThread().getStackTrace();                    for (StackTraceElement element: stackTrace) {                        Slog.w(TAG, "    at " + element);                    }                }                Slog.w(TAG, "*** GOODBYE!");                // 11. 杀死进程                Process.killProcess(Process.myPid());                System.exit(10);            }            waitedHalf = false;        }    }
 public void scheduleCheckLocked() {            if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {                // If the target looper has recently been polling, then                // there is no reason to enqueue our checker on it since that                // is as good as it not being deadlocked.  This avoid having                // to do a context switch to check the thread.  Note that we                // only do this if mCheckReboot is false and we have no                // monitors, since those would need to be executed at this point.                mCompleted = true;                return;            }            if (!mCompleted) {                // we already have a check in flight, so no need                return;            }            mCompleted = false;            mCurrentMonitor = null;            mStartTime = SystemClock.uptimeMillis();            mHandler.postAtFrontOfQueue(this);        }

然后去寻找isPolling()代码:
 /**     * Returns whether this looper's thread is currently polling for more work to do.     * This is a good signal that the loop is still alive rather than being stuck     * handling a callback.  Note that this method is intrinsically racy, since the     * state of the loop can change before you get the result back.     *     * <p>This method is safe to call from any thread.     *     * @return True if the looper is currently polling for events.     * @hide     */    public boolean isPolling() {        synchronized (this) {            return isPollingLocked();        }    }    private boolean isPollingLocked() {        // If the loop is quitting then it must not be idling.        // We can assume mPtr != 0 when mQuitting is false.        return !mQuitting && nativeIsPolling(mPtr);    }


最终通过messageQueue的polling方法确认该looper是否alive来判断线程是否活动。
原创粉丝点击