Linux Kernel之Deferred work（Softirq、tasklet、Work queues）来龙去脉浅析

来源：互联网发布：清华经济管理学院知乎编辑：程序博客网时间：2024/06/17 11:26

我们由Linux Kernel中断子系统来龙去脉浅析中可以知道Linux Kernel在处理完一个interrupt后就可能会检查是否有sortirq，如果有，且不在interrupt context中，那么就执行softirq，这也是我们在第5章中断子系统中分析中断的处理全过程的时候未能完成分析的部分，本章将继往开来，将softirq的来龙去脉分析清楚。我们现在再此全面看看asm_do_IRQ到此做了什么：

1. 调用irq_enter处理开始处理该中断的准备工作，其核心的部分在于irq_enter=>__irq_enter=> add_preempt_count(HARDIRQ_OFFSET);，简单来说就是记录下Linux Kernel已经在处理interrupt中。

2. asm_do_IRQàdesc_handle_irq=>desc->handle_irq(irq, desc);真正处理中断。这里handle_irq是handle_level_irq或handle_edge_irq，但是无论是哪一个最后都要调用handle_IRQ_event来处理中断，handle_IRQ_event在处理完中断后退出前会调用local_irq_disable();关闭ARM global interrupt

3. 调用irq_exit做完成中断处理的善后工作，通过sub_preempt_count(IRQ_EXIT_OFFSET);（注：IRQ_EXIT_OFFSET等于HARDIRQ_OFFSET），这是前面的反向操作。此外，我们需要重点关注的是：

if (!in_interrupt() && local_softirq_pending())

invoke_softirq();

就是说如果我们不是在interrtupt处理中或是softirq的处理中，且有softirq pending，那么我们就是开始处理softirq。

我们通常将softirq叫做interrupt handling的bottom half，之前的interrupt handling、也就是我们在Linux Kernel中断子系统来龙去脉浅析中断子系统中详细分析的叫做top half。

本章节我们的任务首先就是从invoke_softirq出发搞清楚softirq和tasklet的来龙去脉，并且讨论在设计某个具体的device driver时如何使用softirq以及相关的注意事项。但是再分析这个之前我们有必要先仔细解释一下上面我们看到的如：add_preempt_count，sub_preempt_count，in_interrupt是什么意思，否则我们无法理解asm_do_IRQ这段code。

1.1 struct thread_info概念的介绍

Linux Kernel通过struct task_struct管理每个process相关information、状态等，而其中的一些low level的process information记录在struct thread_info中，从struct thread_info下的task pointer可以找到该process的struct task_struct。从struct task_struct中的stack（stack pointer register）也可以找到该struct thread_info。

每个process运行在Kernel Mode时，其stack有8K bytes，the bottom of stack是这个8K bytes的高地址，the top of the stack存储了该process的struct thread_info，所以Kernel Mode下每个process的stack最大只能是8K bytes－ sizeof(struct thread_info)，详见如下：

union thread_union {

struct thread_info thread_info;

unsigned long stack[THREAD_SIZE/sizeof(long)];

};

在arm linux中（include/asm-arm/thread_info.h）:#define THREAD_SIZE 8192

1.1.1 当前process：current macro

这个是我们设计device driver时经常使用到的，但是有没有分析过其current的来龙去脉呢？

include/asm-arm/current.h：

static inline struct task_struct *get_current(void) __attribute_const__;

static inline struct task_struct *get_current(void)

{

return current_thread_info()->task;

}

#define current (get_current())

1.1.2 current_thread_info

include/asm-arm/thread_info.h：

static inline struct thread_info *current_thread_info(void)

{

return (struct thread_info *)(sp & ~(THREAD_SIZE - 1));

}

1.1.3 preempt_count以及add_preempt_count和sub_preempt_count

include/linux/preempt.h：

#define add_preempt_count(val) do { preempt_count() += (val); } while (0)

#define sub_preempt_count(val) do { preempt_count() -= (val); } while (0)

#define preempt_count() (current_thread_info()->preempt_count)

preempt_count是struct thread_info中一个member，是32bits integer，其分为三个部分：

31 28 27 15 7 0

hardware irq soft irq preemptive

对该32bits 的preempt_count描述如下：

1. Bits 0－7和bit 28都跟preempive有关，我们重点关注soft irq field和hard irq field，每次Linux Kernel要进入soft irq或hardware irq时都会将对应的field加一，退出时则反之。

2. 对preemptive Kernel，只要该32 bits的preempt_count不为0，就不能进行preempt。

3. 这样我们就可以在任何情况下知道是否处于hardware interrupt或soft interrtupt下，这对于我们写可以中断其它interrupt的code是非常有帮助的。在include/linux/hardirq.h文件中定义了：

#define hardirq_count() (preempt_count() & HARDIRQ_MASK) //是否在hardware irq中

#define softirq_count() (preempt_count() & SOFTIRQ_MASK) //是否在soft irq中

#define irq_count() (preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK))//是否在hardware irq或soft irq中

看一个例子：irq_enter=>__irq_enter=> add_preempt_count(HARDIRQ_OFFSET)。

1.1.4 in_interrupt

是否在interrtupt处理中或是softirq的处理中，详见include/linux/hardirq.h：

#define irq_count()(preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK))

#define in_interrupt() (irq_count())

1.1.5 local_bh_enable和local_bh_disable

这个跟我们之前看到local_irq_enable和local_irq_disable有点类似，这两个可以认为是enable或disable interrupt handling的top half，只是它控制的更加彻底，直接控制的是ARM global interrtupt。_local_bh_enable（local_bh_enable）和local_bh_disable用于enable或 disable interrupt handling的bottom half。

1. local_bh_disable=>__local_bh_disable=>add_preempt_count(SOFTIRQ_OFFSET);而由本章开头的时候可知：如果此时来了hardware irq，处理完后检查是否可以执行soft irq时，发现in_interrupt返回为非0，所以就不会执行soft irq了。这样就起到了disable bottom half的作用。

2. _local_bh_enable和local_bh_disable相反，但是Linux Kernel还提供了一个相似的版本local_bh_enable，它除了_local_bh_enable工作外，还要检查是否可以现在执行soft irq，performance相对高些。

所以我们一般直接使用local_bh_enable和local_bh_disable，而不会使用_local_bh_enable和local_bh_disable，尽管后面的才是真正的“一对”。

1.1.6 local_softirq_pending

Linux Kernel在多CPU的版本里面为每个CPU定义了一个irq_cpustat_t structure用来描述每个CPU关于irq的一些信息：

typedef struct {

unsigned int __softirq_pending;

unsigned int local_timer_irqs; //smp才用到，我们不用管。

} ____cacheline_aligned irq_cpustat_t;

其中我们要关心的是第一个__softirq_pending，其长度为32 bits。每一个bit对应一个soft irq，该bit为1代表有对应的soft irq待处理，0代表没有。

Kernel最多只支持32个soft irq，但是实际目前仅仅实现了前六个。这些soft irq都定义在 kernel/softirq.c中：

static struct softirq_action softirq_vec[32] __cacheline_aligned_in_smp;

而struct softirq_action定义在include/linux/interrtupt.h：

struct softirq_action

{

void (*action)(struct softirq_action *);

void *data;

};

这个32个soft irq的优先级对应于softirq_vec[32]中的index，0的优先级最高。实际上__do_softirq执行这些soft irq时也就是从softirq_vec[0]到softirq_vec[31]逐个检查执行而已。

1.2 Softirqs的来龙去脉

invoke_softirq其实是一个macro，在ARM linux中其实际上就是：__do_softirq。

asmlinkage void __do_softirq(void)主要的工作如下：

1. __local_bh_disable；

2. set_softirq_pending清零__softirq_pending；

3. local_irq_enable打开ARM global interrupt。由5.3.2.3下4知道执行到此时ARM global interrupt是处于关闭状态的。

4. scan softirq_vec，如果有pending就执行h->action(h);

5. local_irq_disable关闭ARM global interrupt。

6. 从提高performance的角度考虑，再次检查在前面第4步骤的执行过程中是否有新的soft irq要执行，有则wakeup_softirqd启动kernel thread：ksoftirqd执行soft irq。

7. _local_bh_enable。

下面要解决的一个问题是softirq_vec中的action是如何被创建起来的，或者说我们designer如果要使用softirq的话，该如何初始化，如何实现了？

讨论：

2. Ksoftirqd是个kernel thead，它也是通过do_softirq来执行这些soft irq，为什么叫thread？

3. 由上可知，softirq是不会被嵌套的。

1.2.1 softirq_vec中action和data的建立

Linux Kernel通过open_softirq来建立softirq_vec中action和data，其实现也很简单：

void open_softirq(int nr, void (*action)(struct softirq_action*), void *data)

{

softirq_vec[nr].data = data;

softirq_vec[nr].action = action;

}

且通过raise_softirq来让一个softirq有效，即设置__softirq_pending对应的bit位，其实现也很简单：

raise_softirqàraise_softirq_irqoff=>__raise_softirq_irqoff=>do { or_softirq_pending(1UL << (nr)); } while (0)。

经过这两步骤后，__do_softirq就会执行此softirq了。

下面我们先看一个在分析时间子系统的时候，在Linux Kernel时间子系统之来龙去脉浅析中碰到的例子：

1. 首先在init_timers=>open_softirq(TIMER_SOFTIRQ, run_timer_softirq, NULL);

2. tick_handle_periodic=> tick_periodic=> update_process_times=> run_local_timers=>raise_softirq(TIMER_SOFTIRQ);

参考该例子，我们应该很清楚如果自己要实现一个softirq也是很简单的。但是我们一般不建议使用softirq，如果device driver中需要实现bottom half，请使用tasklets，本身tasklet也是基于softirq来实现的，它是32个softirq中的一个，它使用了HI_SOFTIRQ（0最高优先级的softirq，比前面讲到的timer还要高）和TASKLET_SOFTIRQ（5）两个softirq vector。

1.3 Tasklets的来龙去脉

Linux Kernel实现了两类tasklet，两个priority不同，其它都很类似。其分别对应softirq vector中的index：HI_SOFTIRQ和TASKLET_SOFTIRQ。下面我们直接以low priority的tasklet来做分析。

在TASKLET_SOFTIRQ对应的softirq下可以建立多个tasklets，这些tasklets存储在tasklet_vec中。static DEFINE_PER_CPU(struct tasklet_head, tasklet_vec) = { NULL }; 展开为：

struct tasklet_head per_cpu__tasklet_vec；

注：由include/asm-generic/percpu.h中：

#define DEFINE_PER_CPU(type, name) __typeof__(type) per_cpu__##name

所以tasklet_vec就是一个list，包含了多个tasklet，其structure的定义是：

struct tasklet_struct

{

struct tasklet_struct *next; // tasklet_vec list中下一个tasklet。

unsigned long state; //status，TASKLET_STATE_SCHED：已经在tasklet_vec中并等待执行，TASKLET_STATE_RUN：正在被执行，这个也仅仅用于smp。在up中，由于softirq是不会被嵌套的，所以判定该tasklet是否正在被执行没有意义了。

atomic_t count; //lock count。 tasklet_enable和tasklet_disable就是对此操作。

void (*func)(unsigned long);//该tasklet对应的function

unsigned long data; //tasklet function的parameter。

};

1.3.1 TASKLET_SOFTIRQ softirq的init

void __init softirq_init(void)

{

open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL);

open_softirq(HI_SOFTIRQ, tasklet_hi_action, NULL);

}

1.3.2 Tasklet的操作

在LDD一书中已经讲的很清楚了，我们直接去LDD 7.5章节来看看。其实都比较简单，我们可以对照着Kernel sourece code直接分析之。

1.3.3 Tasklet是如何被执行的

我们接着1.2中4，对于TASKLET_SOFTIRQ softirq，其对应的action就是tasklet_action，它做的事情其实就是scan整个tasklet_vec list找到没有lock（没有被tasklet_disable）且其state为TASKLET_STATE_SCHED的tasklet，并执行其function。

讨论：

4. 在1.1中我们曾经说明当一个process处于Kernel Mode时其stack只有8K bytes－ sizeof(struct thread_info)，那么该stack有没有可能overflow了？原因可能如下：

a．在某个drvier的某个函数中定义一个很多的structure或array：int a[100000]；这里跟写AP就不同了。这是新手常犯的问题。许多老手，知道有这个限制，但是不知道why？其实AP也是有限制的，只是会比较大，一般不会超过，但是也有人犯过。有兴趣的同仁可以：man getrlimit看看。

b．中断嵌套太深了（the level of nesting interrupt），每次进入新的中断都要保存interrupt context into Kernel mode stack。原则上来说如果我们定义了IRQF_DISABLED就没有这个可能性。

c． Recursion function

d．其它。

5. Interrupt handler或softirq中是否可以进入sleep，即schedule？答案是否定的。那么设计者为什么要这样做呢？我认为原因并不是技术面的困难，从技术面而言是可以实现的。我个人的一些理解如下：

a．首先，当中断产生时，而current process可能根本不关心该中断，而如果该中断中sleep了，这样就是说current process由于跟自己一个不相关的事情而被迫sleep了，这个不公平，也不合理。而且如果current process有时间要求，比如说是一个user interface process，user可能要抓狂了。

b．中断处理通常有一定的时间要求，如果由于任何原因而进入sleep，其时间就不可控了。

所以Linux kernel的实现者在schedule=>schedule_debug：当处于hardware irq 或 soft irq时不允许进行schedule，详见：

if (unlikely(in_atomic_preempt_off()) && unlikely(!prev->exit_state))

__schedule_bug(prev);

所以我们在设计interrupt handler或softirq、tasklet是不允许使用任何可能产生schedule的kernel function：dynamic memory allocation withoutGFP_ATOMIC（如kmalloc）、wait_event、down（semaphore）、wait_for_completion等等。这些东西我们在后面两章中还会进一步讨论。

6. 那么sortirq或hard irq中是否可以访问User Space的data呢？不能，没有意义，因为当中断产生时根本无法知道当前CPU正在处理哪个process。

7. 那么interrupt handler中如果要和其它code（如read，write）有race condition该如何处理呢？我们之前曾经讨论过的circular buffer会有问题么？（interrupt write且read() read）。

8. 中断可以嵌套（nesting），但是softirq却不会被其它的softirq嵌套；而且由于无论中断还是softirq处理中都不会发生schedule，所以我们有称interrupt handler或softirq handler is in atomic。

1.4 Work queues

当我们create一个work queue时，其实Linux Kernel就是create了一个kernel thread，然后这个kernel thread就不断的检查该work queue上时候需要执行的work，有则执行，无则sleep。

1.4.1 Work queue creating

通常有三个方法可以create一个work queue：

1. #define create_workqueue(name) __create_workqueue((name), 0, 0)

2. #define create_freezeable_workqueue(name) __create_workqueue((name), 1, 1)

3. #define create_singlethread_workqueue(name) __create_workqueue((name), 1, 0)

其中1是每个CPU create一个work queue，3是就算是SMP它也只是create一个，对我们U3或No1单CPU的architecture来说两个是一样的，当然为了将来的extensibility，我们最好直接用3好了。

2是create一个freezeable的work queue，其实就是该work queue对应的kernel thread（对应的执行function为：worker_thread）是可以freeze的。这是Power Managerment中的一个概念，我们以后会分析；现在大家只要知道在要进行deep sleep的时候Power Managerment就会通过try_to_freeze_tasks将该kernel thread freeze，之后一旦该kernel thread被唤醒执行时：worker_threadàtry_to_freezeàrefrigerator进入frozen state。（这里的用此很形象了）。简单来说就是通知那些process或kernel thread什么事情都不要做了，现在要进入deep sleep了，免得会影响我。

此时该work queue什么work都没有我们需要用queue_work添加work让其执行。

1.4.2 用queue_work添加一个work

queue_work=>__queue_work=>insert_work=>list_add(&work->entry, &cwq->worklist);

另有一个版本：queue_delayed_work，delay一段时间（其实就是启动一个timer）后，才将该work加入到work queue中。

很简单，不再赘述，可以一起看看代码。

其它的一些接口，不妨看看LDD中7.6.

1.4.3 Work queue和softirq的一些不同点

1. Work queue是由一个kernel thread来执行的，所以work queue没有不允许使用任何可能导致schedule（sleep）的操作。

2. softirq通常可以在更快的时间内被执行，最迟也就是在下一个tick到来时被执行，而work queue至少要等到该kernel thread被调度执行才可以。