Linux内核抢占机制(preempt)

来源：互联网发布：手机网络gsm cdma lte 编辑：程序博客网时间：2024/06/05 20:47

早期的Linux核心是不可抢占的。它的调度方法是：一个进程可以通过schedule()函数自愿地启动一次调度。非自愿的强制性调度只能发生在每次从系统调用返回的前夕以及每次从中断或异常处理返回到用户空间的前夕。但是，如果在系统空间发生中断或异常是不会引起调度的。这种方式使内核实现得以简化。但常存在下面两个问题：
如果这样的中断发生在内核中,本次中断返回是不会引起调度的,而要到最初使CPU从用户空间进入内核空间的那次系统调用或中断(异常)返回时才会发生调度。
另外一个问题是优先级反转。在Linux中，在核心态运行的任何操作都要优先于用户态进程，这就有可能导致优先级反转问题的出现。例如，一个低优先级的用户进程由于执行软/硬中断等原因而导致一个高优先级的任务得不到及时响应。
当前的Linux内核加入了内核抢占(preempt)机制。内核抢占指用户程序在执行系统调用期间可以被抢占，该进程暂时挂起，使新唤醒的高优先级进程能够运行。这种抢占并非可以在内核中任意位置都能安全进行，比如在临界区中的代码就不能发生抢占。临界区是指同一时间内不可以有超过一个进程在其中执行的指令序列。在Linux内核中这些部分需要用自旋锁保护。
内核抢占要求内核中所有可能为一个以上进程共享的变量和数据结构就都要通过互斥机制加以保护，或者说都要放在临界区中。在抢占式内核中，认为如果内核不是在一个中断处理程序中，并且不在被 spinlock等互斥机制保护的临界代码中，就认为可以"安全"地进行进程切换。
Linux内核将临界代码都加了互斥机制进行保护，同时，还在运行时间过长的代码路径上插入调度检查点，打断过长的执行路径，这样，任务可快速切换进程状态，也为内核抢占做好了准备。

Linux内核抢占只有在内核正在执行例外处理程序（通常指系统调用）并且允许内核抢占时，才能进行抢占内核。

如果内核中的进程被阻塞了，或它显式地调用了schedule()，内核抢占也会显式地发生。这种形式的内核抢占从来都是受支持的(实际上是主动让出CPU)，因为根本无需额外的逻辑来保证内核可以安全地被抢占。如果代码显式的调用了schedule()，那么它应该清楚自己是可以安全地被抢占的。

内核抢占可能发生在：

当从中断处理程序正在执行，且返回内核空间之前。

当内核代码再一次具有可抢占性的时候，如解锁及使能软中断等。

如果内核中的任务显式的调用schedule()

如果内核中的任务阻塞(这同样也会导致调用schedule())

禁止内核抢占的情况列出如下：（1）内核执行中断处理例程时不允许内核抢占，中断返回时再执行内核抢占。
（2）当内核执行软中断或tasklet时，禁止内核抢占，软中断返回时再执行内核抢占。
（3）在临界区禁止内核抢占，临界区保护函数通过抢占计数宏控制抢占，计数大于0，表示禁止内核抢占。
抢占式内核实现的原理是在释放自旋锁时或从中断返回时，如果当前执行进程的 need_resched 被标记，则进行抢占式调度。
Linux内核在线程信息结构上增加了成员preempt_count作为内核抢占锁，为0表示可以进行内核抢占，它随spinlock和 rwlock等一起加锁和解锁。线程信息结构thread_info列出如下（在include/asm-x86/thread_info.h中）：

struct thread_info

{

struct task_struct *task;

struct exec_domain *exec_domain;

__u32 flags;

__u32 status;

__u32 cpu;

int preempt_count;

mm_segment_t addr_limit;

struct restart_block restart_block;

#ifdef CONFIG_IA32_EMULATION

void __user *sysenter_return;

#endif

};

内核调度器的入口为preempt_schedule()，他将当前进程标记为TASK_PREEMPTED状态再调用schedule()，在TASK_PREEMPTED状态，schedule()不会将进程从运行队列中删除。
内核抢占API函数
在中断或临界区代码中，线程需要关闭内核抢占，因此，互斥机制（如：自旋锁（spinlock）、RCU等）、中断代码、链表数据遍历等需要关闭内核抢占，临界代码运行完时，需要开启内核抢占。关闭/开启内核抢占需要使用内核抢占API函数preempt_disable和 preempt_enable。
内核抢占API函数说明如下（在include/linux/preempt.h中）：
preempt_enable() //内核抢占计数preempt_count减1
preempt_disable() //内核抢占计数preempt_count加1
preempt_enable_no_resched()　 //内核抢占计数preempt_count减1，但不立即抢占式调度
preempt_check_resched () //如果必要进行调度
preempt_count() //返回抢占计数

preempt_schedule() //核抢占时的调度程序的入口点

内核抢占API函数的实现宏定义列出如下（在include/linux/preempt.h中）：

#define preempt_disable() /

do { /

inc_preempt_count(); /

barrier(); / //加内存屏障，阻止gcc编译器对内存进行优化

} while (0)

#define inc_preempt_count() /

do {

/ preempt_count()++; /

} while (0)

#define preempt_count() (current_thread_info()->preempt_count)

内核抢占调度
Linux内核在硬中断或软中断返回时会检查执行抢占调度。分别说明如下：
（1）硬中断返回执行抢占调度
Linux内核在硬中断或出错退出时执行函数retint_kernel，运行抢占函数，函数retint_kernel列出如下（在arch/x86/entry_64.S中）：

#ifdef CONFIG_PREEMPT ENTRY(retint_kernel)

cmpl $0,threadinfo_preempt_count(%rcx)

jnz retint_restore_args

bt $TIF_NEED_RESCHED,threadinfo_flags(%rcx)

jnc retint_restore_args

bt $9,EFLAGS-ARGOFFSET(%rsp)

jnc retint_restore_args

call preempt_schedule_irq

jmp exit_intr

#endif

函数preempt_schedule_irq是出中断上下文时内核抢占调度的入口点，该函数被调用和返回时中断应关闭，保护此函数从中断递归调用。该函数列出如下（在kernel/sched.c中）：

asmlinkage void __sched preempt_schedule_irq(void)

{

struct thread_info *ti = current_thread_info();

BUG_ON(ti->preempt_count || !irqs_disabled());

do{

add_preempt_count(PREEMPT_ACTIVE);

local_irq_enable(); schedule();

local_irq_disable();

sub_preempt_count(PREEMPT_ACTIVE);

barrier();

} while (unlikely(test_thread_flag(TIF_NEED_RESCHED)));

}

调度函数schedule会检测进程的 preempt_counter 是否很大，避免普通调度时又执行内核抢占调度。
（2）软中断返回执行抢占调度
在打开页出错函数pagefault_enable和软中断底半部开启函数local_bh_enable中，会调用函数 preempt_check_resched检查是否需要执行内核抢占。如果不是并能调度，进程才可执行内核抢占调度。函数 preempt_check_resched列出如下：

#define preempt_check_resched() /

do { / if (unlikely(test_thread_flag(TIF_NEED_RESCHED))) / preempt_schedule(); / } while (0)

函数preempt_schedule源代码与函数preempt_schedule_irq基本上一样，对进程进行调度，这里不再分析。

如何支持抢占内核
抢占式Linux内核的修改主要有两点：一是对中断的入口代码和返回代码进行修改。在中断的入口内核抢占锁preempt_count加1，以禁止内核抢占；在中断的返回处，内核抢占锁preempt_count减1，使内核有可能被抢占。

我们说可抢占Linux内核在内核的任一点可被抢占，主要就是因为在任意一点中断都有可能发生，每当中断发生，Linux可抢占内核在处理完中断返回时都会进行内核的可抢占判断。若内核当前所处状态允许被抢占，内核都会重新进行调度选取高优先级的进程运行。这一点是与非可抢占的内核不一样的。在非可抢占的Linux内核中，从硬件中断返回时，只有当前被中断进程是用户态进程时才会重新调度，若当前被中断进程是核心态进程，则不进行调度，而是恢复被中断的进程继续运行。

另一基本修改是重新定义了自旋锁、读、写锁，在锁操作时增加了对preempt count变量的操作。在对这些锁进行加锁操作时preemptcount变量加1，以禁止内核抢占；在释放锁时preemptcount变量减1，并在内核的抢占条件满足且需要重新调度时进行抢占调度。下面以spin_lock(), spin_unlock()操作为例说明：

/////////////////////////////////////////////////////////////////////////

/linux+v2.6.19/kernel/spinlock.c

void __lockfunc _spin_unlock(spinlock_t *lock)

{

spin_release(&lock->dep_map, 1, _RET_IP_);

_raw_spin_unlock(lock);

preempt_enable();

}

EXPORT_SYMBOL(_spin_unlock);

void __lockfunc _spin_lock(spinlock_t *lock)

{

preempt_disable();

spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);

_raw_spin_lock(lock);

}

EXPORT_SYMBOL(_spin_lock);

/////////////////////////////////////////////////////////////////////////

#define preempt_disable() /

do { /

inc_preempt_count(); /

barrier(); /

} while (0)

#define preempt_enable_no_resched() /

do { /

barrier(); /

dec_preempt_count(); /

} while (0)

#define preempt_check_resched() /

do { /

if (unlikely(test_thread_flag(TIF_NEED_RESCHED))) /

preempt_schedule(); /

} while (0)

#define preempt_enable() /

do { /

preempt_enable_no_resched(); /

barrier(); /

preempt_check_resched(); /

} while (0)

另外一种可抢占内核实现方案是在内核代码段中插入抢占点(preemption point)的方案。在这一方案中，首先要找出内核中产生长延迟的代码段，然后在这一内核代码段的适当位置插入抢占点，使得系统不必等到这段代码执行完就可重新调度。这样对于需要快速响应的事件，系统就可以尽快地将服务进程调度到CPU运行。抢占点实际上是对进程调度函数的调用，代码如下:

if (current->need_ resched) schedule();

通常这样的代码段是一个循环体，插入抢占点的方案就是在这一循环体中不断检测need_ resched的值，在必要的时候调用schedule()令当前进程强行放弃CPU

何时需要重新调度

内核必须知道在什么时候调用schedule()。如果仅靠用户程序代码显式地调用schedule()，它们可能就会永远地执行下去。相反，内核提供了一个need_resched标志来表明是否需要重新执行一次调度。当某个进程耗尽它的时间片时，scheduler tick()就会设置这个标志；当一个优先级高的进程进入可执行状态的时候，try_to_wake_up也会设置这个标志。

set_ tsk_need_resched：设置指定进程中的need_ resched标志

clear tsk need_resched：清除指定进程中的need_ resched标志

need_resched()：检查need_ resched标志的值;如果被设置就返回真，否则返回假

信号量、等到队列、completion等机制唤醒时都是基于waitqueue的，而waitqueue的唤醒函数为default_wake_function，其调用try_to_wake_up将进程更改为可运行状态并置待调度标志。

在返回用户空间以及从中断返回的时候，内核也会检查need_resched标志。如果已被设置，内核会在继续执行之前调用调度程序。

每个进程都包含一个need_resched标志，这是因为访问进程描述符内的数值要比访问一个全局变量快(因为current宏速度很快并且描述符通常都在高速缓存中)。在2.2以前的内核版本中，该标志曾经是一个全局变量。2.2到2.4版内核中它在task_struct中。而在2.6版中，它被移到thread_info结构体里，用一个特别的标志变量中的一位来表示。可见，内核开发者总是在不断改进。

/linux+v2.6.19/include/linux/sched.h

static inline void set_tsk_need_resched(struct task_struct *tsk)

{

set_tsk_thread_flag(tsk,TIF_NEED_RESCHED);

}

static inline void clear_tsk_need_resched(struct task_struct *tsk)

{

clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);

}

static inline int signal_pending(struct task_struct *p)

{

return unlikely(test_tsk_thread_flag(p,TIF_SIGPENDING));

}

static inline int need_resched(void)

{

return unlikely(test_thread_flag(TIF_NEED_RESCHED));

}

///////////////////////////////////////////////////////////////////////////////

/linux+v2.6.19/kernel/sched.c

* resched_task - mark a task 'to be rescheduled now'.

* On UP this means the setting of the need_resched flag, on SMP it

* might also involve a cross-CPU call to trigger the scheduler on

* the target CPU.

#ifdef CONFIG_SMP

#ifndef tsk_is_polling

#define tsk_is_polling(t) test_tsk_thread_flag(t, TIF_POLLING_NRFLAG)

#endif

static void resched_task(struct task_struct *p)

{

int cpu;

assert_spin_locked(&task_rq(p)->lock);

if (unlikely(test_tsk_thread_flag(p, TIF_NEED_RESCHED)))

return;

set_tsk_thread_flag(p, TIF_NEED_RESCHED);

cpu = task_cpu(p);

if (cpu == smp_processor_id())

return;

/* NEED_RESCHED must be visible before we test polling */

smp_mb();

if (!tsk_is_polling(p))

smp_send_reschedule(cpu);

}

#else

static inline void resched_task(struct task_struct *p)

{

assert_spin_locked(&task_rq(p)->lock);

set_tsk_need_resched(p);

}

#endif

/***

* try_to_wake_up - wake up a thread

* @p: the to-be-woken-up thread

* @state: the mask of task states that can be woken

* @sync: do a synchronous wakeup?

* Put it on the run-queue if it's not already there. The "current"

* thread is always on the run-queue (except when the actual

* re-schedule is in progress), and as such you're allowed to do

* the simpler "current->state = TASK_RUNNING" to mark yourself

* runnable without the overhead of this.

* returns failure only if the task is already active.

static int try_to_wake_up(struct task_struct *p, unsigned int state, int sync)

int fastcall wake_up_process(struct task_struct *p)

{

return try_to_wake_up(p, TASK_STOPPED | TASK_TRACED |

TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE, 0);

}

EXPORT_SYMBOL(wake_up_process);

int fastcall wake_up_state(struct task_struct *p, unsigned int state)

{

return try_to_wake_up(p, state, 0);

}

* wake_up_new_task - wake up a newly created task for the first time.

* This function will do some initial scheduler statistics housekeeping

* that must be done for every newly created context, then puts the task

* on the runqueue and wakes it.

void fastcall wake_up_new_task(struct task_struct *p, unsigned long clone_flags);

* The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just

* wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve

* number) then we wake all the non-exclusive tasks and one exclusive task.

* There are circumstances in which we can try to wake a task which has already

* started to run but is not in state TASK_RUNNING. try_to_wake_up() returns

* zero in this (rare) case, and we handle it by continuing to scan the queue.

static void __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, int sync, void *key)

/**

* __wake_up - wake up threads blocked on a waitqueue.

* @q: the waitqueue

* @mode: which threads

* @nr_exclusive: how many wake-one or wake-many threads to wake up

* @key: is directly passed to the wakeup function

void fastcall __wake_up(wait_queue_head_t *q, unsigned int mode,

int nr_exclusive, void *key)

{

unsigned long flags;

spin_lock_irqsave(&q->lock, flags);

__wake_up_common(q, mode, nr_exclusive, 0, key);

spin_unlock_irqrestore(&q->lock, flags);

}

EXPORT_SYMBOL(__wake_up);

int default_wake_function(wait_queue_t *curr, unsigned mode, int sync, void *key)

{

return try_to_wake_up(curr->private, mode, sync);

}

EXPORT_SYMBOL(default_wake_function);

void fastcall complete(struct completion *x)

{

unsigned long flags;

spin_lock_irqsave(&x->wait.lock, flags);

x->done++;

__wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE,

1, 0, NULL);

spin_unlock_irqrestore(&x->wait.lock, flags);

}

EXPORT_SYMBOL(complete);

阅读全文

0 0