linux进程调度器(__schedule)框架

来源：互联网发布：银行家算法实验总结编辑：程序博客网时间：2024/06/03 20:07

http://blog.csdn.net/lonewolfxw/article/details/7906903

1.调度器概述

由于现在的计算机系统运行的任务的个数远远超过处理器核心的个数，因此导致了各任务在共享处理器、寄存器资源，为了实现处理器时间在各个任务之间公平的分配，实现程序并行运行的假象，操作系统内核需要进程调度器来尽量公平的在各个进程之间分配运行时间。受到现实问题的影响，调度器实现变得很复杂：

需要在各进程间尽量公平的分配处理器时间
由于更重要的进程要比次重要的进程分配更多的处理器时间，因此需要时间优先级调度，差异化进程
进程的切换次数不能太频繁，否则导致处理器的效率降低，将时间消耗在进程切换上
两次进程切换的时间又不能太长，否则导致某些进程相应缓慢

上面的条件已经有各种矛盾了，是个难搞的活，下面看一下linux的内核调度器的框架：

linux通用的的调度器框架包括主调度器和周期性调度器，调度器类是实现了不同调度策略的实例，such as CFS、实时调度器，调度器类判断接下来要执行哪个进程。主调度器使用特定调度器类的选择进程，然后负责同底层CPU交互。下面来看看调度器相关的数据结构。

2. 数据结构

先来看看熟悉的struct task_struct结构和调度器相关的成员：

[cpp] view plain copy

int on_rq;
int prio, static_prio, normal_prio;
unsigned int rt_priority;
const struct sched_class *sched_class;
struct sched_entity se;
struct sched_rt_entity rt;

on_rq：表示进程是否在就绪队列上面
prio，static_prio， normal_prio ，rt_priority和进程的优先级相关，prio，normal_prio是进程的动态优先级，由于内核有时候可能需要临时提高进程的优先级，因此增加了prio变量，such as 为了防止优先级反转rt_mutex提高当前持有锁进程的优先级就是设置prio变量然后引起重调度，normal_prio是用于调度器计算进程的weight用的，在CFS中可以看到，normal_prio越高，weight越大，表示当前进程所占的权重较大，就可以获得跟多的处理器时间。rt_priority是进程的实时优先级，在普通进程中没有用到。
se 就是调度实体，是调度器作用的对象，因此task_struct中嵌入这个对象就可以被调度器调度
sched_class 这个是调度器具体实现的接口，主要包含就绪队列的入队和出队操作（对于CFS来将就绪队列是红黑树），还有就是从就绪队列中选择下一个要执行的进程，周期性调度的底层操作，以及修改进程优先级的和内核抢占等操作。

具体看看struct sched_entity结构，这个包含了和调度器相关的重要成员

[html] view plain copy

struct sched_entity {
struct load_weight load; /* for load-balancing */
struct rb_node run_node;
struct list_head group_node;
unsigned int on_rq;
u64 exec_start;
u64 sum_exec_runtime;
u64 vruntime;
u64 prev_sum_exec_runtime;
u64 nr_migrations;
#ifdef CONFIG_SCHEDSTATS
struct sched_statistics statistics;
#endif
#ifdef CONFIG_FAIR_GROUP_SCHED
struct sched_entity *parent;
/* rq on which this entity is (to be) queued: */
struct cfs_rq *cfs_rq;
/* rq "owned" by this entity/group: */
struct cfs_rq *my_q;
#endif
};

load就是由进程优先级计算而来的表示进程权重的值
run_node使得进程可以串在就绪队列上，CFS的就绪队列为一颗红黑树
on_rq指示进程是否在就绪队列上，就绪队列上的进程表示进程可以运行，等待获得处理器时间，当进程被调度执行时进程从就绪队列上删除并且将on_rq设为0
exec_start这是个动态更新的值，在进程被调度执行时更新为当前时间，表示此次调度开始执行的时间
sum_exec_runtime表示总的在处理器上执行的时间，由于进程调度不能太频繁，内核保证每个进程都会执行一段时间才允许被抢占，sum_exec_runtime-prev_exec_runtime就表示此次调度执行的时间。
vruntime是进程的在完全公平的优先级调度的情况下进程运行的时间，CFS调度中最重要的一个值了，CFS的就绪队列的黑红树的键值就是它了，每次调度就找vruntime最小的进程执行，应该是位于黑红树的最左边的进程。
cfs_rq这个就是CFS的就绪队列，就绪队列是每一个处理器都有一个

3. 调度框架

先看看周期性调度，想想也可以差不多想出来周期性调度要干什么的，每一个cpu的时钟周期都触发一次进程调度，由上面可以看到进程需要维护调度的当前时间，因此这个函数需要更新进程的当前调度时间，然后就是调用特定调度器类的周期调度函数就ok啦，大体的框框应该是这样子，具体细节还有不少，看看代码：

[html] view plain copy

/*
* This function gets called by the timer code, with HZ frequency.
* We call it with interrupts disabled.
*/
void scheduler_tick(void)
{
int cpu = smp_processor_id();
struct rq *rq = cpu_rq(cpu);
struct task_struct *curr = rq->curr;
sched_clock_tick(); /*处理硬件时钟的一些地方，和我们不相关*/
raw_spin_lock(&rq->lock);
update_rq_clock(rq); /*更新就绪队列的时间*/
update_cpu_load_active(rq);
curr->sched_class->task_tick(rq, curr, 0); /*调用调度器类的底层函数，这个函数会设置进程重调度标志TIF_NEED_RESCHED表示需要重新调度，然后内核会在适当的时机（比如系统调用结束重返用户空间之前）调度，因此周期性调度并不执行真正的调度任务，只是设置一个重调度请求的标志而已*/
raw_spin_unlock(&rq->lock);
perf_event_task_tick();
#ifdef CONFIG_SMP
rq->idle_balance = idle_cpu(cpu);
trigger_load_balance(rq, cpu);
#endif
}

周期性调度还是比较简单，没有涉及到处理真正的调度任务，下面看看主调器，也就是它来响应周期性调度器的TIF_NEED_RESCHED请求执行调度任务的。

在看主调度器之前先看看内核抢占，2.5版本的内核之前，在内核态运行的程序是不能被抢占，只能等内核运行完成调度器才能调度其他的程序运行，这会造成很大的系统延时，在2.5中加入了内核抢占，在非重要的区域内核是可以被抢占，每个进程都维护了一个抢占计数器preempt_count，preempt_count为0时表示可以抢占，大于0是表示不能抢占，当需要禁止抢占的时候就调用inc_preempt_count将preempt_count加1。当前可以被抢占的时候且已经被抢占的时候将preempt_count加PREEMPT_ACTIVE表示这个进程是被内核抢占的，为了避免其他的inc_preempt_count调用影响此标志位，PREEMPT_ACTIVE =0x1<<30

主调度器的任务就比较复杂了：

[html] view plain copy

* __schedule() is the main scheduler function.

* The main means ofdriving the scheduler and thus entering this function are:

* 1. Explicit blocking(明确的阻塞):mutex,semaphore,waitqueue, etc.

(例如在信号量相关函数中会显示调用schedule()函数)

* 2. TIF_NEED_RESCHED flag is checked on interruptand userspace return paths.

For example, seearch/x86/entry_64.S.(在中断与系统调用返回用户空间时)

* To drivepreemption between tasks, the scheduler sets the flag in timer

* interrupthandler scheduler_tick().

* 3. Wakeups don'treally cause entry into schedule(). They add a

* task to therun-queue and that's it.

*Now, if the new task (added to the run-queue) preempts the current

*task, then the wakeup sets TIF_NEED_RESCHEDand schedule() gets

*called on the nearest possible occasion:

* - If the kernel is preemptible (CONFIG_PREEMPT=y):

* - in syscall orexception context, at the next outmost

*preempt_enable(). (this might be as soon as the wake_up()'s

* spin_unlock()!)

* - in IRQcontext, return from interrupt-handler to

* preemptiblecontext

* - If the kernelis not preemptible (CONFIG_PREEMPT is not set)

* then at thenext:

* - cond_resched()call

* - explicitschedule() call

* - return fromsyscall or exception to user-space

* - return frominterrupt-handler to user-space

2. static void __sched __schedule(void) //kernel/sched/core.c

{

struct task_struct *prev, *next;

unsigned long *switch_count;

struct rq *rq;

int cpu;

need_resched:

preempt_disable(); /*停止内核抢占，关键区域.抢占的原因就是为了进行新的调度，没有理由将调度程序抢占掉再运行调度程序*/

cpu = smp_processor_id();

rq = cpu_rq(cpu); /*获得当前cpu的就绪队列*/

rcu_note_context_switch(cpu);

prev = rq->curr; /*在cpu上运行的当前进程，也就是准备被调度离开cpu的进程*/

schedule_debug(prev);

if (sched_feat(HRTICK))

hrtick_clear(rq);

raw_spin_lock_irq(&rq->lock);

switch_count = &prev->nivcsw; //记录当前进程切换的次数

if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
/* 在调用__schedule()函数之前，如果set_current_state()将该进程的状态从执行状态
TASK_RUNNING变成睡眠状态TASK_INTERRUPTIBLE。如果schedule()是被一个状态为
TASK_RUNNING 的进程调度，那么schedule()将调度另外一个进程占用CPU；如果schedule()是
被一个状态为TASK_INTERRUPTIBLE 或TASK_UNINTERRUPTIBLE 的进程调度，那么还有一
个附加的步骤将被执行：当前执行的进程在另外一个进程被调度之前会被从运行队列中移出，这
将导致正在运行的那个进程进入睡眠，因为它已经不在运行队列中了*/
/*This check use to guarantee task not be dequeue by preempt schedule.

当前进程非运行状态(prev->state != 0)，并且非内核抢占
#defineTASK_RUNNING 0
#defineTASK_INTERRUPTIBLE 1
#defineTASK_UNINTERRUPTIBLE 2
#define preempt_count()(current_thread_info()->preempt_count)
获取当前的进程内核抢占计数，如果设置PREEMPT_ACTIVE表示已经被抢占，
为了使得被抢占的进程可以快速恢复执行，不会执行下面的使进程停止活动的操作*/
/*
* We usebit 30 of the preempt_count to indicate that kernel
*preemption is occurring（即当前进程已经被抢占）.See<asm/hardirq.h>.
* #definePREEMPT_ACTIVE 0x40000000
*/
/*This means if task be signalinterrupted, so it need keep running to handle the comes signal,
only for TASK_INTERRUPTABLE task.*/

26. if (unlikely(signal_pending_state(prev->state, prev))) {

prev->state = TASK_RUNNING; //如果当前进程还有信号要处理，要设置为就绪，

//不能从运行(或者说就绪)对列中删除，否则signal将处理不了了

28. } else { //若为非挂起信号则将其从队列中移出

deactivate_task(rq, prev, DEQUEUE_SLEEP);

/*是进程停止活动,将其从运行队列中删除*/

30. prev->on_rq = 0; //设置不在runqueue上

* If a worker went to sleep, notify and ask workqueue

* whether it wants to wake up a task to maintain

* concurrency.

*//*这个跟内核线程相关*/

if (prev->flags & PF_WQ_WORKER) {

struct task_struct *to_wakeup;

to_wakeup = wq_worker_sleeping(prev, cpu);

if (to_wakeup)

try_to_wake_up_local(to_wakeup);

}

}

switch_count = &prev->nvcsw;

}

pre_schedule(rq, prev);/*在CFS中无操作*/

if (unlikely(!rq->nr_running)) //如果runqueue中没有正在运行的进程

idle_balance(cpu, rq); //就会从其它CPU拉入进程

put_prev_task(rq, prev);/*将让出处理器的进程加入到就绪队列中，并且统计就绪队列相关数

据*/

54. next = pick_next_task(rq); /*从就绪队列rq中选择下一个要执行的进程，这两个操作的主体

都是在具体的调度器类中实现，而不是在linux调度器框架中实现，

因此实际实现将在CFS和实时调度中说明*/

55. clear_tsk_need_resched(prev);/*由于调度已经完成，要清除TIF_NEED_RESCHED标志位*/

rq->skip_clock_update = 0;

/*当前进程与所选进程是否是同一进程，不属于同一进程才需要切换*/

if (likely(prev != next)) {/*在调度器找不到需要运行的进程时才会相等*/

rq->nr_switches++;

rq->curr = next; //所选进程代替当前进程

++*switch_count;

context_switch(rq,prev,next);/*unlocks the rq 处理底层的上下文切换的操作*/

* The context switch have flipped the stack from under us

* and restored the local variables which were saved when

* this task called schedule() in the past. prev == current

* is still correct, but it can be moved to another cpu/rq.

cpu = smp_processor_id();/*由于切换了新的进程，新进程可能运行在不同cpu上面*/

rq = cpu_rq(cpu); /*同样的理由需要更新就绪队列*/

} else

raw_spin_unlock_irq(&rq->lock);

post_schedule(rq);

preempt_enable_no_resched(); //开启内核抢占

if (need_resched()) /*如果新的被切换上来的进程设置了TIF_NEED_RESCHED标志，则又重新

调度，这个是可能的，当这个进程设置了TIF_NEED_RESCHED标志位之后被高优先级

的进程抢占了就会发生这种情况*/

80. goto need_resched;

}

阅读全文

0 0