多线程编程－－对pthread_cond_wait()函数的理解

来源：互联网发布：linux终端删除～$ 编辑：程序博客网时间：2024/09/21 09:00

对pthread_cond_wait()函数的理解(我在CU上回复一个人的问题的解答)
(个人见解，如有错误，恳请大家指出）) G- e0 N5 _$ D  p8 _- R
/************pthread_cond_wait()的使用方法**********/
; R$ ]& e* F" ^: A# l+ @( ]
pthread_mutex_lock(&qlock);
pthread_cond_wait(&qready, &qlock);
pthread_mutex_unlock(&qlock);
/*****************************************************/
The mutex passed to pthread_cond_wait protects the condition.The caller passes it locked to the function, which then atomically places the calling thread on the list of threads waiting for the condition and unlocks the mutex. This closes the window between the time that the condition is checked and the time that the thread goes to sleep waiting for the condition to change, so that the thread doesn't miss a change in the condition. When pthread_cond_wait returns, the mutex is again locked.
2 N0 }$ S. n- b, @
1 Y) O7 E: N+ B7 j8 B8 |
上面是APUE的原话，就是说pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex)函数传入的参数mutex用于保护条件，因为我们在调用pthread_cond_wait时，如果条件不成立我们就进入阻塞，但是进入阻塞这个期间，如果条件变量改变了的话，那我们就漏掉了这个条件。因为这个线程还没有放到等待队列上，所以调用pthread_cond_wait前要先锁互斥量，即调用pthread_mutex_lock(),pthread_cond_wait在把线程放进阻塞队列后，自动对mutex进行解锁，使得其它线程可以获得加锁的权利。这样其它线程才能对临界资源进行访问并在适当的时候唤醒这个阻塞的进程。当pthread_cond_wait返回的时候又自动给mutex加锁。
: f$ v  g% X/ V" `" ]3 x  v& I
实际上边代码的加解锁过程如下：6 |5 p9 S+ L  C% M4 J, [( {0 Q
/************pthread_cond_wait()的使用方法**********/

pthread_mutex_lock(&qlock); /*lock*/
pthread_cond_wait(&qready, &qlock); /*block-->unlock-->wait() return-->lock*/9 V: c9 Z$ W; g- I6 j6 S: I
pthread_mutex_unlock(&qlock); /*unlock*/% t' X- s) K- V
/*****************************************************/
/ j2 E8 f* ?$ V# ^" T. @
很不错的一篇文章，对POSIX的线程的取消点(Cancellation Point)的概念和实现方式做了深入的解析, ZZ一下。
# x% L2 `. Q. f% w
以下ZZ自：http://blog.solrex.cn/articles/l ... llation-points.html2 A1 z) E* L8 ]

摘要：
这篇文章主要从一个 Linux 下一个 pthread_cancel 函数引起的多线程死锁小例子出发来说明 Linux 系统对 POSIX 线程取消点的实现方式，以及如何避免因此产生的线程死锁。( j! y3 b& t: ~% S2 B3 r1 W

目录：! c2 q0 }3 g" [! X
1. 一个 pthread_cancel 引起的线程死锁小例子
2. 取消点(Cancellation Point)
3. 取消类型(Cancellation Type). V4 z# t& C: _5 c$ ^) X: D% a
4. Linux 的取消点实现
5. 对示例函数进入死锁的解释
6. 如何避免因此产生的死锁9 q# F# E# O/ B
7. 结论
8. 参考文献% P: S. T( R: v2 G
1. 一个 pthread_cancel 引起的线程死锁小例子+ y; _3 D. ~& ]1 Y
下面是一段在 Linux 平台下能引起线程死锁的小例子。这个实例程序仅仅是使用了条件变量和互斥量进行一个简单的线程同步，thread0 首先启动，锁住互斥量 mutex，然后调用 pthread_cond_wait，它将线程 tid[0] 放在等待条件的线程列表上后，对 mutex 解锁。thread1 启动后等待 10 秒钟，此时 pthread_cond_wait 应该已经将 mutex 解锁，这时 tid[1] 线程锁住 mutex，然后广播信号唤醒 cond 等待条件的所有等待线程，之后解锁 mutex。当 mutex 解锁后，tid[0] 线程的 pthread_cond_wait 函数重新锁住 mutex 并返回，最后 tid[0] 再对 mutex 进行解锁。1 i8 ^, o+ \$ F/ A4 m
[code:c]& U! O. Q8 H; t! C6 {0 h1 r
1 #include <pthread.h>
2( W3 \7 L- T( y! d. j
3 pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
4 pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
5( T3 _2 J! A3 {/ O& X* a1 @
6 void* thread0(void* arg)0 K, x" i) l5 J  r: R/ u9 f! M+ N
7 {
8 pthread_mutex_lock(&mutex);( g! X4 o) F) g$ q& p  Y  N
9 pthread_cond_wait(&cond, &mutex);
10 pthread_mutex_unlock(&mutex);
11 pthread_exit(NULL);* H* C1 b* f. u4 l: e
12 }
13
14 void* thread1(void* arg)
15 {
16 sleep(10);
17 pthread_mutex_lock(&mutex);
18 pthread_cond_broadcast(&cond);$ Z" S8 Z, l$ s( G+ [: S
19 pthread_mutex_unlock(&mutex);, l1 z& g6 q  L: C8 d+ J9 s
20 pthread_exit(NULL);/ ^: ]. n  z3 b, P  q
21 }
22 int main()
23 {. p; U0 E/ z3 {* {; g  z) ^, W
24 pthread_t tid[2];( T: J- y/ m+ @. h' d7 d
25 if (pthread_create(&tid[0], NULL, &thread0, NULL) != 0) {
26 exit(1);5 B  B( h, H8 Q1 A  {) n
27 }) k% d% ?  }4 }% P$ G" ^7 n0 b4 m
28 if (pthread_create(&tid[1], NULL, &thread1, NULL) != 0) {
29 exit(1);, v2 y) j, X) t
30 }
31 sleep(5);4 i. q* K5 u; p& c% _2 \. C
32 pthread_cancel(tid[0]);
33) u- _# W4 g' K0 {$ {, }3 O  m
34 pthread_join(tid[0], NULL);% q& d! O: \2 X& U6 M0 _
35 pthread_join(tid[1], NULL);& y/ R6 B5 K6 F, w8 Q
36
37 pthread_mutex_destroy(&mutex);
38 pthread_cond_destroy(&cond);% g2 r' a8 s: l+ h0 ^6 Y
39 return 0;4 I4 |' R" _3 x
40 }[/code]: c; N$ e0 A) j
看起来似乎没有什么问题，但是 main 函数调用了一个 pthread_cancel 来取消 tid[0] 线程。上面程序编译后运行时会发生无法终止情况，看起来像是 pthread_cancel 将 tid[0] 取消时没有执行 pthread_mutex_unlock 函数，这样 mutex 就被永远锁住，线程 tid[1] 也陷入无休止的等待中。事实是这样吗？
2. 取消点(Cancellation Point)
要注意的是 pthread_cancel 调用并不等待线程终止，它只提出请求。线程在取消请求(pthread_cancel)发出后会继续运行，直到到达某个取消点(Cancellation Point)。取消点是线程检查是否被取消并按照请求进行动作的一个位置。pthread_cancel manual 说以下几个 POSIX 线程函数是取消点：

pthread_join(3)
pthread_cond_wait(3)
pthread_cond_timedwait(3)
pthread_testcancel(3)
sem_wait(3)
sigwait(3)

复制代码

在中间我们可以找到 pthread_cond_wait 就是取消点之一。5 I" f% H2 C( T
但是，令人迷惑不解的是，所有介绍 Cancellation Points 的文章都仅仅说，当线程被取消后，将继续运行到取消点并发生取消动作。但我们注意到上面例子中 pthread_cancel 前面 main 函数已经 sleep 了 5 秒，那么在 pthread_cancel 被调用时，thread0 到底运行到 pthread_cond_wait 没有？9 d1 w4 _  ~; w4 W
如果 thread0 运行到了 pthread_cond_wait，那么照上面的说法，它应该继续运行到下一个取消点并发生取消动作，而后面并没有取消点，所以 thread0 应该运行到 pthread_exit 并结束，这时 mutex 就会被解锁，这样就不应该发生死锁啊。
3. 取消类型(Cancellation Type)
我们会发现，通常的说法：某某函数是 Cancellation Points，这种方法是容易令人混淆的。因为函数的执行是一个时间过程，而不是一个时间点。其实真正的 Cancellation Points 只是在这些函数中 Cancellation Type 被修改为 PHREAD_CANCEL_ASYNCHRONOUS 和修改回 PTHREAD_CANCEL_DEFERRED 中间的一段时间。; o; H3 G5 q4 {# j% H
POSIX 的取消类型有两种，一种是延迟取消(PTHREAD_CANCEL_DEFERRED)，这是系统默认的取消类型，即在线程到达取消点之前，不会出现真正的取消；另外一种是异步取消(PHREAD_CANCEL_ASYNCHRONOUS)，使用异步取消时，线程可以在任意时间取消。7 v4 ^6 u. J1 v" i/ o. q7 W& s
4. Linux 的取消点实现
下面我们看 Linux 是如何实现取消点的。(其实这个准确点儿应该说是 GNU 取消点实现，因为 pthread 库是实现在 glibc 中的。) 我们现在在 Linux 下使用的 pthread 库其实被替换成了 NPTL，被包含在 glibc 库中。
以 pthread_cond_wait 为例，glibc-2.6/nptl/pthread_cond_wait.c 中：2 L# W# K- Q. N# _1 h' U' Q
[code:c]
145 /* Enable asynchronous cancellation. Required by the standard. */) W$ ]# s3 S. i: F
146 cbuffer.oldtype = __pthread_enable_asynccancel ();. {  [( F  {; r7 [  W
1478 T4 H6 [) p4 f4 r. Q( H
148 /* Wait until woken by signal or broadcast. */1 k+ M# x  p+ e- {
149 lll_futex_wait (&cond->__data.__futex, futex_val);. e$ W- M$ m0 i) N
1500 V( ?  X+ v# c3 ^+ j0 Y/ N
151 /* Disable asynchronous cancellation. */* B" g' m9 u6 E/ h4 A! L5 a" {+ X
152 __pthread_disable_asynccancel (cbuffer.oldtype);[/code]
我们可以看到，在线程进入等待之前，pthread_cond_wait 先将线程取消类型设置为异步取消(__pthread_enable_asynccancel)，当线程被唤醒时，线程取消类型被修改回延迟取消 __pthread_disable_asynccancel 。3 `: A( w9 m0 Y
这就意味着，所有在 __pthread_enable_asynccancel 之前接收到的取消请求都会等待 __pthread_enable_asynccancel 执行之后进行处理，所有在 __pthread_disable_asynccancel 之前接收到的请求都会在 __pthread_disable_asynccancel 之前被处理，所以真正的 Cancellation Point 是在这两点之间的一段时间。; r# ^# a3 f, v% Y
5. 对示例函数进入死锁的解释
当 main 函数中调用 pthread_cancel 前，thread0 已经进入了 pthread_cond_wait 函数并将自己列入等待条件的线程列表中(lll_futex_wait)。这个可以通过 GDB 在各个函数上设置断点来验证。$ b% w2 i% B/ x# q
当 pthread_cancel 被调用时，tid[0] 线程仍在等待，取消请求发生在 __pthread_disable_asynccancel 前，所以会被立即响应。但是 pthread_cond_wait 为注册了一个线程清理程序（glibc-2.6/nptl/pthread_cond_wait.c）：
[code:c]+ u8 q' u( V  X
126 /* Before we block we enable cancellation. Therefore we have to# {  x* k5 M! m. t! P3 T4 l  i8 H8 j
127 install a cancellation handler. */
128 __pthread_cleanup_push (&buffer, __condvar_cleanup, &cbuffer);[/code]
那么这个线程清理程序 __condvar_cleanup 干了什么事情呢？我们可以注意到在它的实现最后（glibc-2.6/nptl/pthread_cond_wait.c）：8 Z: H7 b/ E, c! O, B6 e
[code:c]
85 /* Get the mutex before returning unless asynchronous cancellation
86 is in effect. */
87 __pthread_mutex_cond_lock (cbuffer->mutex);
88}[/code]
哦，__condvar_cleanup 在最后将 mutex 重新锁上了。而这时候 thread1 还在休眠(sleep(10))，等它醒来时，mutex 将会永远被锁住，这就是为什么 thread1 陷入无休止的阻塞中。
6. 如何避免因此产生的死锁
由于线程清理函数 pthread_cleanup_push 使用的策略是先进后出(FILO)，那么我们可以在 pthread_cond_wait 函数前先注册一个线程处理函数：4 K) v& O; C  j& e5 k3 p& J0 m1 k
[code:c]
void cleanup(void *arg)
{6 L2 T& U2 {- q( D
pthread_mutex_unlock(&mutex);
}& j6 c" A6 D! _& \( h
void* thread0(void* arg)
{# d3 G4 D/ N; |7 u* R) F
pthread_cleanup_push(cleanup, NULL); // thread cleanup handler
pthread_mutex_lock(&mutex);
pthread_cond_wait(&cond, &mutex);
pthread_mutex_unlock(&mutex);
pthread_cleanup_pop(0);
pthread_exit(NULL);) |* v" a$ p; X+ @) p3 T4 Y
}[/code]
这样，当线程被取消时，先执行 pthread_cond_wait 中注册的线程清理函数 __condvar_cleanup，将 mutex 锁上，再执行 thread0 中注册的线程处理函数 cleanup，将 mutex 解锁。这样就避免了死锁的发生。9 J! P0 w2 j* k6 A* C1 y. k" }$ i
7. 结论
多线程下的线程同步一直是一个让人很头痛的问题。POSIX 为了避免立即取消程序引起的资源占用问题而引入的 Cancellation Points 概念是一个非常好的设计，但是不合适的使用 pthread_cancel 仍然会引起线程同步的问题。了解 POSIX 线程取消点在 Linux 下的实现更有助于理解它的机制和有利于更好的应用这个机制。