转帖--关于spin_lock-（居然首次关注内核执行路径，惭愧）

来源：互联网发布：江苏减招知乎编辑：程序博客网时间：2024/05/18 12:35

关于spin_lock

需要澄清的是，互斥手段的选择，不是根据临界区的大小，而是根据临界区的性质，以及
有哪些部分的代码，即哪些内核执行路径来争夺。
从严格意义上说，semaphore和spinlock_XXX属于不同层次的互斥手段，前者的
实现有赖于后者，这有点象HTTP和TCP的关系，都是协议，但层次是不同的。
先说semaphore，它是进程级的，用于多个进程之间对资源的互斥，虽然也是在
内核中，但是该内核执行路径是以进程的身份，代表进程来争夺资源的。如果
竞争不上，会有context switch，进程可以去sleep，但CPU不会停，会接着运行
其他的执行路径。从概念上说，这和单CPU或多CPU没有直接的关系，只是在
semaphore本身的实现上，为了保证semaphore结构存取的原子性，在多CPU中需要
spinlock来互斥。
在内核中，更多的是要保持内核各个执行路径之间的数据访问互斥，这是最基本的
互斥问题，即保持数据修改的原子性。semaphore的实现，也要依赖这个。在单CPU
中，主要是中断和bottom_half的问题，因此，开关中断就可以了。在多CPU中，
又加上了其他CPU的干扰，因此需要spinlock来帮助。这两个部分结合起来，
就形成了spinlock_XXX。它的特点是，一旦CPU进入了spinlock_XXX，它就不会
干别的，而是一直空转，直到锁定成功为止。因此，这就决定了被
spinlock_XXX锁住的临界区不能停，更不能context switch，要存取完数据后赶快
出来，以便其他的在空转的执行路径能够获得spinlock。这也是spinlock的原则
所在。如果当前执行路径一定要进行context switch，那就要在schedule()之前
释放spinlock，否则，容易死锁。因为在中断和bh中，没有context，无法进行
context switch，只能空转等待spinlock，你context switch走了，谁知道猴年
马月才能回来。
因为spinlock的原意和目的就是保证数据修改的原子性，因此也没有理由在spinlock
锁住的临界区中停留。
spinlock_XXX有很多形式，有
spin_lock()/spin_unlock()，
spin_lock_irq()/spin_unlock_irq()，
spin_lock_irqsave/spin_unlock_irqrestore()
spin_lock_bh()/spin_unlock_bh()
local_irq_disable/local_irq_enable
local_bh_disable/local_bh_enable
那么，在什么情况下具体用哪个呢？这要看是在什么内核执行路径中，以及要与哪些内核
执行路径相互斥。我们知道，内核中的执行路径主要有：
1 用户进程的内核态，此时有进程context，主要是代表进程在执行系统调用
    等。
2 中断或者异常或者自陷等，从概念上说，此时没有进程context，不能进行
    context switch。
3 bottom_half，从概念上说，此时也没有进程context。
4 同时，相同的执行路径还可能在其他的CPU上运行。
这样，考虑这四个方面的因素，通过判断我们要互斥的数据会被这四个因素中
的哪几个来存取，就可以决定具体使用哪种形式的spinlock。如果只要和其他CPU
互斥，就要用spin_lock/spin_unlock，如果要和irq及其他CPU互斥，就要用
spin_lock_irq/spin_unlock_irq，如果既要和irq及其他CPU互斥，又要保存
EFLAG的状态，就要用spin_lock_irqsave/spin_unlock_irqrestore，如果
要和bh及其他CPU互斥，就要用spin_lock_bh/spin_unlock_bh，如果不需要和
其他CPU互斥，只要和irq互斥，则用local_irq_disable/local_irq_enable，
如果不需要和其他CPU互斥，只要和bh互斥，则用local_bh_disable/local_bh_enable，
等等。值得指出的是，对同一个数据的互斥，在不同的内核执行路径中，
所用的形式有可能不同(见下面的例子)。
举一个例子。在中断部分中有一个irq_desc_t类型的结构数组变量irq_desc[]，
该数组每个成员对应一个irq的描述结构，里面有该irq的响应函数等。
在irq_desc_t结构中有一个spinlock，用来保证存取(修改)的互斥。
对于具体一个irq成员，irq_desc[irq]，对其存取的内核执行路径有两个，一是
在设置该irq的响应函数时(setup_irq)，这通常发生在module的初始化阶段，或
系统的初始化阶段；二是在中断响应函数中(do_IRQ)。代码如下：
int setup_irq(unsigned int irq, struct irqaction * new)
{
        int shared = 0;
        unsigned long flags;
        struct irqaction *old, **p;
        irq_desc_t *desc = irq_desc + irq;
        /*
         * Some drivers like serial.c use request_irq() heavily,
         * so we have to be careful not to interfere with a
         * running system.
         */
        if (new->flags & SA_SAMPLE_RANDOM) {
                /*
                 * This function might sleep, we want to call it first,
                 * outside of the atomic block.
                 * Yes, this might clear the entropy pool if the wrong
                 * driver is attempted to be loaded, without actually
                 * installing a new handler, but is this really a problem,
                 * only the sysadmin is able to do this.
                 */
                rand_initialize_irq(irq);
        }
        /*
         * The following block of code has to be executed atomically
         */
[1]     spin_lock_irqsave(&desc->lock,flags);
        p = &desc->action;
        if ((old = *p) != NULL) {
                /* Can't share interrupts unless both agree to */
                if (!(old->flags & new->flags & SA_SHIRQ)) {
[2]                     spin_unlock_irqrestore(&desc->lock,flags);
                        return -EBUSY;
                }
                /* add new interrupt at end of irq queue */
                do {
                        p = &old->next;
                        old = *p;
                } while (old);
                shared = 1;
        }
        *p = new;
        if (!shared) {
                desc->depth = 0;
                desc->status &= ~(IRQ_DISABLED | IRQ_AUTODETECT |IRQ_WAITING);
                desc->handler->startup(irq);
        }
[3]     spin_unlock_irqrestore(&desc->lock,flags);
        register_irq_proc(irq);
        return 0;
}
asmlinkage unsigned int do_IRQ(struct pt_regs regs)
{
        /*
         * We ack quickly, we don't want the irq controller
         * thinking we're snobs just because some other CPU has
         * disabled global interrupts (we have already done the
         * INT_ACK cycles, it's too late to try to pretend to the
         * controller that we aren't taking the interrupt).
         *
         * 0 return value means that this irq is already being
         * handled by some other CPU. (or is disabled)
         */
        int irq = regs.orig_eax & 0xff; /* high bits used in ret_from_ code */
        int cpu = smp_processor_id();
        irq_desc_t *desc = irq_desc + irq;
        struct irqaction * action;
        unsigned int status;
        kstat.irqs[cpu][irq]++;
[4]     spin_lock(&desc->lock);
        desc->handler->ack(irq);
        /*
           REPLAY is when Linux resends an IRQ that was dropped earlier
           WAITING is used by probe to mark irqs that are being tested
           */
        status = desc->status & ~(IRQ_REPLAY | IRQ_WAITING);
        status |= IRQ_PENDING; /* we _want_ to handle it */
        /*
         * If the IRQ is disabled for whatever reason, we cannot
         * use the action we have.
         */
        action = NULL;
        if (!(status & (IRQ_DISABLED | IRQ_INPROGRESS))) {
                action = desc->action;
                status &= ~IRQ_PENDING; /* we commit to handling */
                status |= IRQ_INPROGRESS; /* we are handling it */
        }
        desc->status = status;
        /*
         * If there is no IRQ handler or it was disabled, exit early.
           Since we set PENDING, if another processor is handling
           a different instance of this same irq, the other processor
           will take care of it.
         */
        if (!action)
                goto out;
        /*
         * Edge triggered interrupts need to remember
         * pending events.
         * This applies to any hw interrupts that allow a second
         * instance of the same irq to arrive while we are in do_IRQ
         * or in the handler. But the code here only handles the _second_
         * instance of the irq, not the third or fourth. So it is mostly
         * useful for irq hardware that does not mask cleanly in an
         * SMP environment.
         */
        for (;;) {
[5]             spin_unlock(&desc->lock);
                handle_IRQ_event(irq, ®s, action);
[6]             spin_lock(&desc->lock);
                if (!(desc->status & IRQ_PENDING))
                        break;
                desc->status &= ~IRQ_PENDING;
        }
        desc->status &= ~IRQ_INPROGRESS;
out:
        /*
         * The ->end() handler has to deal with interrupts which got
         * disabled while the handler was running.
         */
        desc->handler->end(irq);
[7]     spin_unlock(&desc->lock);
        if (softirq_pending(cpu))
                do_softirq();
        return 1;
}
在setup_irq()中，因为其他CPU可能同时在运行setup_irq()，或者在运行setup_irq()时，
本地irq中断来了，要执行do_IRQ()以修改desc->status。为了同时防止来自其他CPU和
本地irq中断的干扰，如[1][2][3]处所示，使用了spin_lock_irqsave/spin_unlock_irqrestore()
而在do_IRQ()中，因为do_IRQ()本身是在中断中，而且此时还没有开中断，本CPU中没有
什么可以中断其运行，其他CPU则有可能在运行setup_irq()，或者也在中断中，但这二者
对本地do_IRQ()的影响没有区别，都是来自其他CPU的干扰，因此只需要用spin_lock/spin_unlock
，
如[4][5][6][7]处所示。值得注意的是[5]处，先释放该spinlock，再调用具体的响应函数。
再举个例子：
static void tasklet_hi_action(struct softirq_action *a)
{
        int cpu = smp_processor_id();
        struct tasklet_struct *list;
[8]     local_irq_disable();
        list = tasklet_hi_vec[cpu].list;
        tasklet_hi_vec[cpu].list = NULL;
[9]     local_irq_enable();
        while (list) {
                struct tasklet_struct *t = list;
                list = list->next;
                if (tasklet_trylock(t)) {
                        if (!atomic_read(&t->count)) {
                                if (!test_and_clear_bit(TASKLET_STATE_SCHED,&t->state))
                                        BUG();
                                t->func(t->data);
                                tasklet_unlock(t);
                                continue;
                        }
                        tasklet_unlock(t);
                }
[10]            local_irq_disable();
                t->next = tasklet_hi_vec[cpu].list;
                tasklet_hi_vec[cpu].list = t;
                __cpu_raise_softirq(cpu, HI_SOFTIRQ);
[11]            local_irq_enable();
        }
}
这里，对tasklet_hi_vec[cpu]的修改，不存在CPU之间的竞争，因为每个CPU有各自独立的数据，
所以只要防止irq的干扰，用local_irq_disable/local_irq_enable即可，如[8][9][10][11]处
所示。
Q:
大侠，你好。
在文章里你写到“如果不需要和其他CPU互斥，只要和bh互斥，则用local_bh_disable/local_bh_enable，”。
不知道如果要系统调用和bh互斥，在系统调用中用local_bh_disable/local_bh_enable，那在bh中用什么呢？
A:
如果你确信数据只会被系统调用和BH修改，那么，在系统调用
中应该用spin_lock_bh/spin_unlock_bh，在BH中用spin_lock/
spin_unlock。原因如下：
1.在系统调用中，因为同时在其他CPU中可能也在执行系统调用
或BH，因此要用spin_lock_ 前缀；在本CPU中，由于随时可能
有中断，而中断结束时会运行BH，所以要用_bh后缀。合在一起
就是spin_lock_bh/spin_unlock_bh。
2.在BH中，同样要防止外CPU的系统调用和BH，因此，
spin_lock_ 前缀是一定要的，对于本CPU，只有中断可以打断
BH的运行，而你又确信中断处理不形成竞争关系，所以，他强
任他强，可以不管他。又因为在一个CPU上，BH是不会重入的，
所以，不需要后缀；合起来，就是spin_lock/spin_unlock。
总结一下，说白了，就是回答两个问题，一，你是谁？即你
当前在哪个内核执行路径中？二，你要防谁？即你要防止哪
几个内核执行路径的干扰？对号入座可以矣。
zhrank says:
>>需要澄清的是，互斥手段的选择，不是根据临界区的大小，而是根据临界区的性质，以及
>>有哪些部分的代码，即哪些内核执行路径来争夺。
我觉得不完全正确,互斥手段的选择, 应该是根据临界区的大小, 临界区的性质以及竞争临界区
的执行路径的数量这三个因数来同时决定.
首先, 如果竞争临界区的执行路径中存在interrupt handler的话, 那么只能选用spinlock
且本地关中断的方法.
其次, 因为semaphore会导致进程上下文切换, 因此如果临界区也就一两百条指令, 也即属于
短期互斥, 且竞争临界区的执行路径数量不多, 那么选用spinlock反而会比用semaphore的
性能要好. 因为上下文切换本身就是一个很大的开销, 另外, 上下文切换后会使得cpu cache
出险大量的cache miss. 从而使系统吞吐量下降.