ART学习笔记Thread SuspendAll部分

来源：互联网发布：时时宝典软件下载编辑：程序博客网时间：2024/06/10 23:02

昨天碰到了一个Gc 时Suspend All 超时导致的Runtime abort问题。

顺带就研究了下Suspend的机制以及超时检查的机制。

第一部分，suspend机制：

在进程被signal 3或者GC或者debugger尝试attach，就会suspend，那么suspend是如何实现的呢？

首先看一个Thread的 dump

"android.fg" prio=5 tid=19 Native  | group="" sCount=1 dsCount=0 obj=0x12e5b740 self=0xb482f000  | sysTid=590 nice=0 cgrp=default sched=0/0 handle=0xb4926d80  | state=S schedstat=( 0 0 0 ) utm=59 stm=63 core=0 HZ=100  | stack=0xa39f2000-0xa39f4000 stackSize=1036KB  | held mutexes=  native: #00 pc 000133cc  /system/lib/libc.so (syscall+28)  native: #01 pc 000a99eb  /system/lib/libart.so (art::ConditionVariable::Wait(art::Thread*)+82)  native: #02 pc 0027c8a5  /system/lib/libart.so (art::GoToRunnable(art::Thread*)+756)  native: #03 pc 00087679  /system/lib/libart.so (art::JniMethodEnd(unsigned int, art::Thread*)+8)  native: #04 pc 000b39d5  /data/dalvik-cache/arm/system@framework@boot.oat (Java_android_os_MessageQueue_nativePollOnce__JI+112)  at android.os.MessageQueue.nativePollOnce(Native method)  at android.os.MessageQueue.next(MessageQueue.java:143)  at android.os.Looper.loop(Looper.java:122)  at android.os.HandlerThread.run(HandlerThread.java:61)  at com.android.server.ServiceThread.run(ServiceThread.java:46)

这个线程在Jni方法调用返回，想从native状态切换为Runnable状态时，检测到当前线程的suspend flag在位，于是进入conditionwait等待唤醒。

ART线程通过TransitionFreomSuspendToRunnable以及TransitionFromRunnableToSuspended两个函数来完成Runable到suspend或其他状态的转换。

而线程的常见状态有如下多种：

enum ThreadState {  //                                   Thread.State   JDWP state  kTerminated = 66,                 // TERMINATED     TS_ZOMBIE    Thread.run has returned, but Thread* still around  kRunnable,                        // RUNNABLE       TS_RUNNING   runnable  kTimedWaiting,                    // TIMED_WAITING  TS_WAIT      in Object.wait() with a timeout  kSleeping,                        // TIMED_WAITING  TS_SLEEPING  in Thread.sleep()  kBlocked,                         // BLOCKED        TS_MONITOR   blocked on a monitor  kWaiting,                         // WAITING        TS_WAIT      in Object.wait()  kWaitingForGcToComplete,          // WAITING        TS_WAIT      blocked waiting for GC  kWaitingForCheckPointsToRun,      // WAITING        TS_WAIT      GC waiting for checkpoints to run  kWaitingPerformingGc,             // WAITING        TS_WAIT      performing GC  kWaitingForDebuggerSend,          // WAITING        TS_WAIT      blocked waiting for events to be sent  kWaitingForDebuggerToAttach,      // WAITING        TS_WAIT      blocked waiting for debugger to attach  kWaitingInMainDebuggerLoop,       // WAITING        TS_WAIT      blocking/reading/processing debugger events  kWaitingForDebuggerSuspension,    // WAITING        TS_WAIT      waiting for debugger suspend all  kWaitingForJniOnLoad,             // WAITING        TS_WAIT      waiting for execution of dlopen and JNI on load code  kWaitingForSignalCatcherOutput,   // WAITING        TS_WAIT      waiting for signal catcher IO to complete  kWaitingInMainSignalCatcherLoop,  // WAITING        TS_WAIT      blocking/reading/processing signals  kWaitingForDeoptimization,        // WAITING        TS_WAIT      waiting for deoptimization suspend all  kWaitingForMethodTracingStart,    // WAITING        TS_WAIT      waiting for method tracing to start  kStarting,                        // NEW            TS_WAIT      native thread started, not yet ready to run managed code  kNative,                          // RUNNABLE       TS_RUNNING   running in a JNI native method  kSuspended,                       // RUNNABLE       TS_RUNNING   suspended by GC or debugger};

在线程切换Suspend状态时，首先是需要将suspendcount++，并将标志KSuspendRequest置位。

这样，在很多场景（几乎各个跳转场景，目前我还没弄清出解释器的执行方式）下在checkSuspend的时候，就会block住：

static inline void CheckSuspend(Thread* thread) {  for (;;) {    if (thread->ReadFlag(kCheckpointRequest)) {      thread->RunCheckpointFunction();    } else if (thread->ReadFlag(kSuspendRequest)) {      thread->FullSuspendCheck();    } else {      break;    }  }}

void Thread::FullSuspendCheck() {  VLOG(threads) << this << " self-suspending";  ATRACE_BEGIN("Full suspend check");  // Make thread appear suspended to other threads, release mutator_lock_.  TransitionFromRunnableToSuspended(kSuspended);  // Transition back to runnable noting requests to suspend, re-acquire share on mutator_lock_.  TransitionFromSuspendedToRunnable();  ATRACE_END();  VLOG(threads) << this << " self-reviving";}

或者在JNI函数退出的时候，尝试从Native状态切成Runnable的时候，也会去做TransitionFreomSuspendToRunnable

在JNI函数进入的时候，从Runnalbe状态切到Native的时候，也会去做TransitionFromRunnableToSuspended

extern void JniMethodEnd(uint32_t saved_local_ref_cookie, Thread* self) {  GoToRunnable(self);  PopLocalReferences(saved_local_ref_cookie, self);}

static void GoToRunnable(Thread* self) NO_THREAD_SAFETY_ANALYSIS {  mirror::ArtMethod* native_method = self->GetManagedStack()->GetTopQuickFrame()->AsMirrorPtr();  bool is_fast = native_method->IsFastNative();  if (!is_fast) {    self->TransitionFromSuspendedToRunnable();  } else if (UNLIKELY(self->TestAllFlags())) {    // In fast JNI mode we never transitioned out of runnable. Perform a suspend check if there    // is a flag raised.    DCHECK(Locks::mutator_lock_->IsSharedHeld(self));    CheckSuspend(self);  }}

那么下面就去看看这个函数的实现：

inline void Thread::TransitionFromRunnableToSuspended(ThreadState new_state) {  AssertThreadSuspensionIsAllowable();  DCHECK_NE(new_state, kRunnable);  DCHECK_EQ(this, Thread::Current());  // Change to non-runnable state, thereby appearing suspended to the system.  DCHECK_EQ(GetState(), kRunnable);  union StateAndFlags old_state_and_flags;  union StateAndFlags new_state_and_flags;  while (true) {    old_state_and_flags.as_int = tls32_.state_and_flags.as_int;    if (UNLIKELY((old_state_and_flags.as_struct.flags & kCheckpointRequest) != 0)) {      RunCheckpointFunction();      continue;    }    // Change the state but keep the current flags (kCheckpointRequest is clear).    DCHECK_EQ((old_state_and_flags.as_struct.flags & kCheckpointRequest), 0);    new_state_and_flags.as_struct.flags = old_state_and_flags.as_struct.flags;    new_state_and_flags.as_struct.state = new_state;    // CAS the value without a memory ordering as that is given by the lock release below.    bool done =        tls32_.state_and_flags.as_atomic_int.CompareExchangeWeakRelaxed(old_state_and_flags.as_int,                                                                        new_state_and_flags.as_int);    if (LIKELY(done)) {      break;    }  }  // Release share on mutator_lock_.  Locks::mutator_lock_->SharedUnlock(this);}

可以切换到suspend状态比较简单，置位+放锁mutator，这个放锁非常重要，是超时检测机制的核心状态，后面会详细讲。

从suspend状态切回runnable：

inline ThreadState Thread::TransitionFromSuspendedToRunnable() {  bool done = false;  union StateAndFlags old_state_and_flags;  old_state_and_flags.as_int = tls32_.state_and_flags.as_int;  int16_t old_state = old_state_and_flags.as_struct.state;  DCHECK_NE(static_cast<ThreadState>(old_state), kRunnable);  do {    Locks::mutator_lock_->AssertNotHeld(this);  // Otherwise we starve GC..    old_state_and_flags.as_int = tls32_.state_and_flags.as_int;    DCHECK_EQ(old_state_and_flags.as_struct.state, old_state);    if (UNLIKELY((old_state_and_flags.as_struct.flags & kSuspendRequest) != 0)) {      // Wait while our suspend count is non-zero.      MutexLock mu(this, *Locks::thread_suspend_count_lock_);      old_state_and_flags.as_int = tls32_.state_and_flags.as_int;      DCHECK_EQ(old_state_and_flags.as_struct.state, old_state);      while ((old_state_and_flags.as_struct.flags & kSuspendRequest) != 0) {// 当未被唤醒时，KsuspendRequst始终不为0，因此，进入conditionwait，也就是上面发的dump那样        // Re-check when Thread::resume_cond_ is notified.        Thread::resume_cond_->Wait(this);        old_state_and_flags.as_int = tls32_.state_and_flags.as_int;        DCHECK_EQ(old_state_and_flags.as_struct.state, old_state);      }      DCHECK_EQ(GetSuspendCount(), 0);    }    // Re-acquire shared mutator_lock_ access.    Locks::mutator_lock_->SharedLock(this);    // Atomically change from suspended to runnable if no suspend request pending.    old_state_and_flags.as_int = tls32_.state_and_flags.as_int;    DCHECK_EQ(old_state_and_flags.as_struct.state, old_state);    if (LIKELY((old_state_and_flags.as_struct.flags & kSuspendRequest) == 0)) {      union StateAndFlags new_state_and_flags;      new_state_and_flags.as_int = old_state_and_flags.as_int;      new_state_and_flags.as_struct.state = kRunnable;      // CAS the value without a memory ordering as that is given by the lock acquisition above.      done =          tls32_.state_and_flags.as_atomic_int.CompareExchangeWeakRelaxed(old_state_and_flags.as_int,                                                                          new_state_and_flags.as_int);    }    if (UNLIKELY(!done)) {      // Failed to transition to Runnable. Release shared mutator_lock_ access and try again.      Locks::mutator_lock_->SharedUnlock(this);    } else {      return static_cast<ThreadState>(old_state);    }  } while (true);

首先去检测，当ksuspendrequst在位时，进入condition wait等待唤醒，唤醒完毕后，那mutator lock，并改变当前的状态，注意这里使用的是原子操作。

其实我们可以看到，suspend的核心就是在KsuspendRequset标志位在位的时候，线程会进入condition wait中等待唤醒，于是乎就suspend停了下来。

那么问题来了，如何检测所有的进程已经suspend好了以便我执行下面的操作了？

第二部分，suspend超时检测机制。

上面有提到，在进入suspend和runnalbe状态的时候，跟随着对mutatorlock的lock和unlock，mutatorlock是线程在运行的时候需要拿的锁，很多函数都声明了需要这个lock的保护，那么我们可以根据mutatorlock的状态来检测是否还有人处于runnalbe状态，art的实现如下:

  // Block on the mutator lock until all Runnable threads release their share of access.#if HAVE_TIMED_RWLOCK  // Timeout if we wait more than 30 seconds.  if (!Locks::mutator_lock_->ExclusiveLockWithTimeout(self, 30 * 1000, 0)) {            UnsafeLogFatalForThreadSuspendAllTimeout();  }#else  Locks::mutator_lock_->ExclusiveLock(self);#endif

尝试去获取mutator_lock,并设置了超时时间为30s，具体的实现就是一个futex_wait，这里就不说了。

其实在dumplog中我们只要找到状态为runnalbe的线程，就是导致超时的原因了。

顺带在说下为什么在synchronized中等待的(blocked)线程并不会超时：

synchronized的实现就就是Monitor,这个和dvm中的实现比较类似，都是基于thin和fat lock的。

当线程进入synchronized等待现在锁的持有者执行完毕的时候，会将自己的状态从runnable切换成block，同时创建了ScopedThreadStateChange对象，而对象的构造函数中调用了上面的方法：

class ScopedThreadStateChange { public:  ScopedThreadStateChange(Thread* self, ThreadState new_thread_state)      LOCKS_EXCLUDED(Locks::thread_suspend_count_lock_) ALWAYS_INLINE      : self_(self), thread_state_(new_thread_state), expected_has_no_thread_(false) {    if (UNLIKELY(self_ == NULL)) {      // Value chosen arbitrarily and won't be used in the destructor since thread_ == NULL.      old_thread_state_ = kTerminated;      Runtime* runtime = Runtime::Current();      CHECK(runtime == NULL || !runtime->IsStarted() || runtime->IsShuttingDown(self_));    } else {      DCHECK_EQ(self, Thread::Current());      // Read state without locks, ok as state is effectively thread local and we're not interested      // in the suspend count (this will be handled in the runnable transitions).      old_thread_state_ = self->GetState();      if (old_thread_state_ != new_thread_state) {        if (new_thread_state == kRunnable) {          self_->TransitionFromSuspendedToRunnable();        } else if (old_thread_state_ == kRunnable) {          self_->TransitionFromRunnableToSuspended(new_thread_state);        } else {          // A suspended transition to another effectively suspended transition, ok to use Unsafe.          self_->SetState(new_thread_state);        }      }    }  }

所以，被block住的线程，也是放掉了mutator_lock的。

在我现在看来，mutatorlock就是线程是否在可执行状态的标志。

总结下，suspend就是将KsuspendRequest置位，在线程执行的某些位置会进行检查，从而完成其他状态到suspend的切换，同时unlock mutatorlock。

suspend的检查就是尝试在30s内获取mutatorlock，如果超时，那么证明还有线程还没有执行完毕当前的任务，则dump stack，并runtime abort。

好了，就到这里，备忘备忘，不过肯定有理解错误的，我很担心我猜测的checksuspend执行点的错误，后面碰到问题再看。

0 0