ART学习笔记Thread SuspendAll部分
来源:互联网 发布:时时宝典软件下载 编辑:程序博客网 时间:2024/06/10 23:02
昨天碰到了一个Gc 时Suspend All 超时导致的Runtime abort问题。
顺带就研究了下Suspend的机制以及超时检查的机制。
第一部分,suspend机制:
在进程被signal 3或者GC或者debugger尝试attach,就会suspend,那么suspend是如何实现的呢?
首先看一个Thread的 dump
"android.fg" prio=5 tid=19 Native | group="" sCount=1 dsCount=0 obj=0x12e5b740 self=0xb482f000 | sysTid=590 nice=0 cgrp=default sched=0/0 handle=0xb4926d80 | state=S schedstat=( 0 0 0 ) utm=59 stm=63 core=0 HZ=100 | stack=0xa39f2000-0xa39f4000 stackSize=1036KB | held mutexes= native: #00 pc 000133cc /system/lib/libc.so (syscall+28) native: #01 pc 000a99eb /system/lib/libart.so (art::ConditionVariable::Wait(art::Thread*)+82) native: #02 pc 0027c8a5 /system/lib/libart.so (art::GoToRunnable(art::Thread*)+756) native: #03 pc 00087679 /system/lib/libart.so (art::JniMethodEnd(unsigned int, art::Thread*)+8) native: #04 pc 000b39d5 /data/dalvik-cache/arm/system@framework@boot.oat (Java_android_os_MessageQueue_nativePollOnce__JI+112) at android.os.MessageQueue.nativePollOnce(Native method) at android.os.MessageQueue.next(MessageQueue.java:143) at android.os.Looper.loop(Looper.java:122) at android.os.HandlerThread.run(HandlerThread.java:61) at com.android.server.ServiceThread.run(ServiceThread.java:46)
这个线程在Jni方法调用返回,想从native状态切换为Runnable状态时,检测到当前线程的suspend flag在位,于是进入conditionwait等待唤醒。
ART线程通过TransitionFreomSuspendToRunnable以及TransitionFromRunnableToSuspended两个函数来完成Runable到suspend或其他状态的转换。
而线程的常见状态有如下多种:
enum ThreadState { // Thread.State JDWP state kTerminated = 66, // TERMINATED TS_ZOMBIE Thread.run has returned, but Thread* still around kRunnable, // RUNNABLE TS_RUNNING runnable kTimedWaiting, // TIMED_WAITING TS_WAIT in Object.wait() with a timeout kSleeping, // TIMED_WAITING TS_SLEEPING in Thread.sleep() kBlocked, // BLOCKED TS_MONITOR blocked on a monitor kWaiting, // WAITING TS_WAIT in Object.wait() kWaitingForGcToComplete, // WAITING TS_WAIT blocked waiting for GC kWaitingForCheckPointsToRun, // WAITING TS_WAIT GC waiting for checkpoints to run kWaitingPerformingGc, // WAITING TS_WAIT performing GC kWaitingForDebuggerSend, // WAITING TS_WAIT blocked waiting for events to be sent kWaitingForDebuggerToAttach, // WAITING TS_WAIT blocked waiting for debugger to attach kWaitingInMainDebuggerLoop, // WAITING TS_WAIT blocking/reading/processing debugger events kWaitingForDebuggerSuspension, // WAITING TS_WAIT waiting for debugger suspend all kWaitingForJniOnLoad, // WAITING TS_WAIT waiting for execution of dlopen and JNI on load code kWaitingForSignalCatcherOutput, // WAITING TS_WAIT waiting for signal catcher IO to complete kWaitingInMainSignalCatcherLoop, // WAITING TS_WAIT blocking/reading/processing signals kWaitingForDeoptimization, // WAITING TS_WAIT waiting for deoptimization suspend all kWaitingForMethodTracingStart, // WAITING TS_WAIT waiting for method tracing to start kStarting, // NEW TS_WAIT native thread started, not yet ready to run managed code kNative, // RUNNABLE TS_RUNNING running in a JNI native method kSuspended, // RUNNABLE TS_RUNNING suspended by GC or debugger};
在线程切换Suspend状态时,首先是需要将suspendcount++,并将标志KSuspendRequest置位。
这样,在很多场景(几乎各个跳转场景,目前我还没弄清出解释器的执行方式)下在checkSuspend的时候,就会block住:
static inline void CheckSuspend(Thread* thread) { for (;;) { if (thread->ReadFlag(kCheckpointRequest)) { thread->RunCheckpointFunction(); } else if (thread->ReadFlag(kSuspendRequest)) { thread->FullSuspendCheck(); } else { break; } }}
void Thread::FullSuspendCheck() { VLOG(threads) << this << " self-suspending"; ATRACE_BEGIN("Full suspend check"); // Make thread appear suspended to other threads, release mutator_lock_. TransitionFromRunnableToSuspended(kSuspended); // Transition back to runnable noting requests to suspend, re-acquire share on mutator_lock_. TransitionFromSuspendedToRunnable(); ATRACE_END(); VLOG(threads) << this << " self-reviving";}
在JNI函数进入的时候,从Runnalbe状态切到Native的时候,也会去做TransitionFromRunnableToSuspended
extern void JniMethodEnd(uint32_t saved_local_ref_cookie, Thread* self) { GoToRunnable(self); PopLocalReferences(saved_local_ref_cookie, self);}
static void GoToRunnable(Thread* self) NO_THREAD_SAFETY_ANALYSIS { mirror::ArtMethod* native_method = self->GetManagedStack()->GetTopQuickFrame()->AsMirrorPtr(); bool is_fast = native_method->IsFastNative(); if (!is_fast) { self->TransitionFromSuspendedToRunnable(); } else if (UNLIKELY(self->TestAllFlags())) { // In fast JNI mode we never transitioned out of runnable. Perform a suspend check if there // is a flag raised. DCHECK(Locks::mutator_lock_->IsSharedHeld(self)); CheckSuspend(self); }}
那么下面就去看看这个函数的实现:
inline void Thread::TransitionFromRunnableToSuspended(ThreadState new_state) { AssertThreadSuspensionIsAllowable(); DCHECK_NE(new_state, kRunnable); DCHECK_EQ(this, Thread::Current()); // Change to non-runnable state, thereby appearing suspended to the system. DCHECK_EQ(GetState(), kRunnable); union StateAndFlags old_state_and_flags; union StateAndFlags new_state_and_flags; while (true) { old_state_and_flags.as_int = tls32_.state_and_flags.as_int; if (UNLIKELY((old_state_and_flags.as_struct.flags & kCheckpointRequest) != 0)) { RunCheckpointFunction(); continue; } // Change the state but keep the current flags (kCheckpointRequest is clear). DCHECK_EQ((old_state_and_flags.as_struct.flags & kCheckpointRequest), 0); new_state_and_flags.as_struct.flags = old_state_and_flags.as_struct.flags; new_state_and_flags.as_struct.state = new_state; // CAS the value without a memory ordering as that is given by the lock release below. bool done = tls32_.state_and_flags.as_atomic_int.CompareExchangeWeakRelaxed(old_state_and_flags.as_int, new_state_and_flags.as_int); if (LIKELY(done)) { break; } } // Release share on mutator_lock_. Locks::mutator_lock_->SharedUnlock(this);}可以切换到suspend状态比较简单,置位+放锁mutator,这个放锁非常重要,是超时检测 机制的核心状态,后面会详细讲。
从suspend状态切回runnable:
inline ThreadState Thread::TransitionFromSuspendedToRunnable() { bool done = false; union StateAndFlags old_state_and_flags; old_state_and_flags.as_int = tls32_.state_and_flags.as_int; int16_t old_state = old_state_and_flags.as_struct.state; DCHECK_NE(static_cast<ThreadState>(old_state), kRunnable); do { Locks::mutator_lock_->AssertNotHeld(this); // Otherwise we starve GC.. old_state_and_flags.as_int = tls32_.state_and_flags.as_int; DCHECK_EQ(old_state_and_flags.as_struct.state, old_state); if (UNLIKELY((old_state_and_flags.as_struct.flags & kSuspendRequest) != 0)) { // Wait while our suspend count is non-zero. MutexLock mu(this, *Locks::thread_suspend_count_lock_); old_state_and_flags.as_int = tls32_.state_and_flags.as_int; DCHECK_EQ(old_state_and_flags.as_struct.state, old_state); while ((old_state_and_flags.as_struct.flags & kSuspendRequest) != 0) {// 当未被唤醒时,KsuspendRequst始终不为0,因此,进入conditionwait,也就是上面发的dump那样 // Re-check when Thread::resume_cond_ is notified. Thread::resume_cond_->Wait(this); old_state_and_flags.as_int = tls32_.state_and_flags.as_int; DCHECK_EQ(old_state_and_flags.as_struct.state, old_state); } DCHECK_EQ(GetSuspendCount(), 0); } // Re-acquire shared mutator_lock_ access. Locks::mutator_lock_->SharedLock(this); // Atomically change from suspended to runnable if no suspend request pending. old_state_and_flags.as_int = tls32_.state_and_flags.as_int; DCHECK_EQ(old_state_and_flags.as_struct.state, old_state); if (LIKELY((old_state_and_flags.as_struct.flags & kSuspendRequest) == 0)) { union StateAndFlags new_state_and_flags; new_state_and_flags.as_int = old_state_and_flags.as_int; new_state_and_flags.as_struct.state = kRunnable; // CAS the value without a memory ordering as that is given by the lock acquisition above. done = tls32_.state_and_flags.as_atomic_int.CompareExchangeWeakRelaxed(old_state_and_flags.as_int, new_state_and_flags.as_int); } if (UNLIKELY(!done)) { // Failed to transition to Runnable. Release shared mutator_lock_ access and try again. Locks::mutator_lock_->SharedUnlock(this); } else { return static_cast<ThreadState>(old_state); } } while (true);
首先去检测,当ksuspendrequst在位时,进入condition wait等待唤醒,唤醒完毕后,那mutator lock,并改变当前的状态,注意这里使用的是原子操作。
其实我们可以看到,suspend的核心就是在KsuspendRequset标志位在位的时候,线程会进入condition wait中等待唤醒,于是乎就suspend停了下来。
那么问题来了,如何检测所有的进程已经suspend好了以便我执行下面的操作了?
第二部分,suspend超时检测机制。
上面有提到,在进入suspend和runnalbe状态的时候,跟随着对mutatorlock的lock和unlock,mutatorlock是线程在运行的时候需要拿的锁,很多函数都声明了需要这个lock的保护,那么我们可以根据mutatorlock的状态来检测是否还有人处于runnalbe状态,art的实现如下:
// Block on the mutator lock until all Runnable threads release their share of access.#if HAVE_TIMED_RWLOCK // Timeout if we wait more than 30 seconds. if (!Locks::mutator_lock_->ExclusiveLockWithTimeout(self, 30 * 1000, 0)) { UnsafeLogFatalForThreadSuspendAllTimeout(); }#else Locks::mutator_lock_->ExclusiveLock(self);#endif
尝试去获取mutator_lock,并设置了超时时间为30s,具体的实现就是一个futex_wait,这里就不说了。
其实在dumplog中我们只要找到状态为runnalbe的线程,就是导致超时的原因了。
顺带在说下为什么在synchronized中等待的(blocked)线程并不会超时:
synchronized的实现就就是Monitor,这个和dvm中的实现比较类似,都是基于thin和fat lock的。
当线程进入synchronized等待现在锁的持有者执行完毕的时候,会将自己的状态从runnable切换成block,同时创建了ScopedThreadStateChange对象,而对象的构造函数中调用了上面的方法:
class ScopedThreadStateChange { public: ScopedThreadStateChange(Thread* self, ThreadState new_thread_state) LOCKS_EXCLUDED(Locks::thread_suspend_count_lock_) ALWAYS_INLINE : self_(self), thread_state_(new_thread_state), expected_has_no_thread_(false) { if (UNLIKELY(self_ == NULL)) { // Value chosen arbitrarily and won't be used in the destructor since thread_ == NULL. old_thread_state_ = kTerminated; Runtime* runtime = Runtime::Current(); CHECK(runtime == NULL || !runtime->IsStarted() || runtime->IsShuttingDown(self_)); } else { DCHECK_EQ(self, Thread::Current()); // Read state without locks, ok as state is effectively thread local and we're not interested // in the suspend count (this will be handled in the runnable transitions). old_thread_state_ = self->GetState(); if (old_thread_state_ != new_thread_state) { if (new_thread_state == kRunnable) { self_->TransitionFromSuspendedToRunnable(); } else if (old_thread_state_ == kRunnable) { self_->TransitionFromRunnableToSuspended(new_thread_state); } else { // A suspended transition to another effectively suspended transition, ok to use Unsafe. self_->SetState(new_thread_state); } } } }
所以,被block住的线程,也是放掉了mutator_lock的。
在我现在看来,mutatorlock就是线程是否在可执行状态的标志。
总结下,suspend就是将KsuspendRequest置位,在线程执行的某些位置会进行检查,从而完成其他状态到suspend的切换,同时unlock mutatorlock。
suspend的检查就是尝试在30s内获取mutatorlock,如果超时,那么证明还有线程还没有执行完毕当前的任务,则dump stack,并runtime abort。
好了,就到这里,备忘备忘,不过肯定有理解错误的,我很担心我猜测的checksuspend执行点的错误,后面碰到问题再看。
- ART学习笔记Thread SuspendAll部分
- ART学习笔记 Rosalloc alloc部分
- Thread部分解释总结笔记
- java学习笔记--Thread
- Android Thread学习笔记。。。
- Boost Thread学习笔记
- thread local 学习笔记
- Boost Thread学习笔记
- Boost Thread学习笔记
- java学习笔记-Thread
- 学习笔记-Thread
- Thread状态学习笔记。
- Boost学习笔记 -- thread
- 【Java】Thread学习笔记
- Python Thread学习笔记
- java.lang.Thread学习笔记
- Boost Thread学习笔记二
- rt-thread学习笔记开篇
- Codeforces Round #313 (Div. 2) D. Equivalent Strings 字符串处理
- <s:action>标签
- 捕获错误信息
- ecos
- sqlplus执行脚本文件时如何传参数
- ART学习笔记Thread SuspendAll部分
- [AD]在组策略中将客户端的远程桌面都打开
- Linux大文件已删除,但df查看已使用的空间并未减少解决
- RebornDB:下一代分布式Key-Value数据库
- 图的存储、遍历、应用
- 01二叉查找树转化成双向链表
- 用jmeter进行接口压力测试的步骤
- Cisco 路由器寄存器配置
- asp 验证码 实现