进程调度

来源：互联网发布：知牛财经黄水晶提现编辑：程序博客网时间：2024/06/06 09:17

进程调度

调度程序负责决定将哪个进程投入运行，何时运行以及运行多长时间。进程调度程序(常常简称调度程序)可看做在可运行态进程之间分配有限的处理器时间资源的内核子系统。调度程序是像Linux这样的多任务操作系统的基础。只有通过调度程序的合理调度，系统资源才能最大限度地发挥作用，多进程才会有并发执行的效果。

多任务

多任务操作系统就是能同时并发地交互执行多个进程的操作系统。

多任务系统可以划分为两类;非抢占式多任务(cooperative multitasking)和抢占式多任务(preemptive multitasking)像所有Uinux的变体和许多其他现代操作系统一样，Linux提供了抢占式的多任务模式。在此模式下，由调度程序来决定什么时候停止一个进程的运行，以便其他进程能够得到执行机会。这个强制的挂起动作就叫做抢占( preemption)。进程在被抢占之前能够运行均时间是预先设置好的，而且有一个专门的名字，叫进程的时间片(timelice)。时间片实际上是分配给每个可运行进程的处理器时间段口有效管理时间片能使调度程序从系统全局的角度做出调度决定，这样做还可以避免个别进程独占系统资源。当今众多现代操作系统对程序运行都采用了动态时间片计算的方式，并且引人了可配置的计算策略.不过我们将看到，Linux独一无二的“公平”调度程度本身并没有采取时间片来达到公乎调度。
非抢占多任务模式下除非进程主动停止运行，否则一直执行。该动作称为yielding.

Linux的进程调度

I/O消耗型和处理器消耗型的进程

进程可以被分为I/O消耗型和处理器消耗型.

调度策略通常要在两个矛盾的目标中间寻找平衡:进程响应迅速(响应时间短)和最大系统利用率(高吞吐量)。为了满足上述需求，调度程序通常采用一套非常复杂的算法来决定最值得运行的进程投入运行，但是它往往并不保证低优先级进程会被公平对待。Linux系统的调度程序更倾向于I/O消耗型程序，以提供更好的程序响应速度。Linux为了保证交互式应用和桌面系统的性能，所以对进程的响应做了优化(缩短响应时间)，更倾向于优先调度}I消耗型进程。虽然如此，但在下面你会看到，调度程序也井未忽略处理器消耗型的进程。

进程优先级

调度算法中最基本的一类就是基于优先级的调度。这是一种根据进程的价值和其对处理器时bl的需求来对进程分级的想法。通常做法是(其并未被Linux系统完全采用)优先级高的进程先运行，低的后运行，相同优先级的进程按轮转方式进行调度(一个接一个.重复进行)。在某些系统中，优先级高的进程使用的时间片也较长口调度程序总是选择时间片未用尽而且优先级最高的进程运行。用户和系统都可以通过设置进程的优先级来影响系统的调度。

Linux有两种优先级范围。

一种是nice值[-20,19].默认值为0.nice值越大意味着优先级更低。
一种是实时优先级。[0,99].值越高优先级越高。任何实时进程的优先级都高于普通进程。实时优先级和nice属于互不相交的两个集合。

时间片

时间片0是一个数值，它表明进程在被抢占前所能持续运行的时间。时间片过长会导致系统对交互的响应表现欠佳，让人觉得系统无法并发执行应用程序;时间片太短会明显增大进程切换带来的处理器耗时，因为肯定会有相当一部分系统时间用在进程切换上，而这些进程能够用来运行的时间片却很短.此外，I/O消耗型和处理器消耗型的进程之间的矛盾在这里也再次.显露出来:I/O消耗型不需要长的时间片，而处理器消耗型的进程则希望越长越好(比如这样可以让它们的高速缓存命中率更高)。

Linux调度算法

Linux调度器是以模块方式提供的，这样做的目的是允许不同类型的进程可以有针对性地选择调度算法。这种模块化结构被称为调度器类(scheduled classes) ,它允许多种不同的可动态添加的调度算法并存，调度属于自己范畴的进程。
进程提供了两种优先级，一种是普通的进程优先级，第二个是实时优先级。
实时进程：
SCHED_FIFO
SCHED_RR
普通进程：
SCHED_NORMOL
任何时候，实时进程的优先级都高于普通进程，实时进程只会被更高级的实时进程抢占，同级实时进程之间是按照FIFO（一次机会做完）或者RR（多次轮转）规则调度的。只要有实时进程在,普通进程几乎无法分到时间。

Unix中的调度算法

在Unix的调度算法中使用时间片有优先级的概念，但是有缺陷。

一个是把时间片映射到具体的时间上，如果两个优先级相同，同为0的时候，可能分别运行50ms，然后切换。如果优先级为20的时候，可能分别运行5ms，就做切换。这两种情况下又可以分别考虑是I/O消耗性或者是计算消耗性就会有不同的偏好。显然这种调度算法无法满足。

还有一个是优先级的差值衡量时间片的时候也会出问题。在nice值为0,1的时候，时间片分别为100ms，95ms。而在nice值为18,19的时候，时间片分别为10ms和5ms。当只有这两个进程的时候区别就非常大。

公平调度

CFS（ Completely Fair Scheduler）的出发点基于一个简单的理念:进程调度的效果应如同系统具备一个理想中的完美多任务处理器。在这种系统中，每个进程将能获得1/n的处理器时间—。是指可运行进程的数量。

同时，我们可以调度给它们无限小的时间周期，所以在任何可测最周期内，我们给予n个进程中每个进程同样多的运行时间。如果我们有两个进程各自运行5ms,各自占用一半的处理器时间。

以上的无限小周期不能做到，切换也是需要时间的。CFS在所有可运行进程总数基础上计算出一个进程应该运行多久，而不是依靠nice值来计算时间片。nice值在CFS中被作为进程获得的处理器运行比的权重:越高的nice值(越低的优先级)进程获得更低的处理器使用权重，这是相对默认值进程的进程而言的:相反，更低的nice值(越高的优先级)的进程获得更高的处理器使用权重。

CFS为完美多任务中的无限小调度周期的近似值设立了一个目标。而这个目标称作“目标延迟”，越小的调度周期将带来越好的交互性，同时也更接近完美的多任务。但是你必须承受更高的切换代价和更差的系统总吞吐能力。让我们假定目标延迟值是20815，我们有两个同样优先级的可运行任务(无论这些任务的优先级是多少)。每个任务在被其他任务抢占前运行10ms,如果我们有4个这样的任务，则每个只能运行5s。进一步设想，如果有20个这样的任务，那么每个仅仅只能获得1ms的运行时间。
现在，让我们再来看看具有不同}3“值的两个可运行进程的运行情况—比如一个具有默认nice值，另一个具有的nice值是5。这些不同的nice值对应不同的权重，所以上述两个进程将获得不同的处理器使用比。在这个例子中，nice值是5的进程的权重将是默认nice进程的1/3。如果我们的目标延迟是20ms，那么这两个进程将分别获得15ms和5ms的处理器时间。再比如我们的两个可运行进程的nice值分别是10和15 ,它们分配的时间片将是多少呢?还是15ms和5ms.可见，绝对的nice值不再影响调度决策:只有相对值才会影响处理器时间的分配比例。

Conclusion: 总结一下，任何进程所获得的处理器时间是由它自己和其他所有可运行进程nice值的相对差值决定的。nice值对时间片的作用不再是算数加权，而是几何加权。任何nice值对应的绝对时间不再是一个绝对值，而是处理器的使用比。

Linux调度的实现

时间记账
进程选择
调度器人口
睡眠和唤醒

时间记账

CFS不再有时间片的概念，但是它也必须维护每个进程运行的时间记账，因为它需要确保每个进程只在公平分配给它的处理器时间内运行。

调度器实体结构

struct sched_entity {struct load_weight load;struct rb_node run_node;struct list_head group_node;unsigned int on_rq;u64 exec_start;u64 sum_exec_runtime;u64 vruntime;u64 prev_sum_exec_runtime;u64 last_wakeup;u64 avg_overlap;u64 nr_migrations;u64 start_runtime;u64 avg_wakeup;/* many stat variables elided, enabled only if CONFIG_SCHEDSTATS is set */};

调度器实体结构作为一个名为se的成员变量，嵌入在进程描述符struct task_struct内。

virtual time

vruntime变量存放进程的虚拟运行时间，该运行时间(花在运行上的时间和)的计算是经过了所有可运行进程总数的标准化(或者说是被加权的)。虚拟时间是以ns为单位的，所以vruntime和定时器节拍不再相关。CFS使用vruntime变量来记录一个程序到底运行了多长时间以及它还应该再运行多久。

static void update_curr(struct cfs_rq *cfs_rq){  struct sched_entity *curr = cfs_rq->curr;  u64 now = rq_of(cfs_rq)->clock;  unsigned long delta_exec;  if (unlikely(!curr))        return;  /*  * Get the amount of time the current task was running  * since the last time we changed load (this cannot  * overflow on 32 bits):  */  delta_exec = (unsigned long)(now - curr->exec_start);  if (!delta_exec)  return;  __update_curr(cfs_rq, curr, delta_exec);  curr->exec_start = now;    if (entity_is_task(curr)) {    struct task_struct *curtask = task_of(curr);        trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);    cpuacct_charge(curtask, delta_exec);    account_group_exec_runtime(curtask, delta_exec);  }}

update_curr() calculates the execution time of the current process and stores that value in delta_exec. It then passes that runtime to __update_curr(), which weights the time by the number of runnable processes.The current process’s vruntime is then incremented by the weighted value:

/** Update the current task’s runtime statistics. Skip current tasks that* are not in our scheduling class.*/static inline void__update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,                unsigned long delta_exec){  unsigned long delta_exec_weighted;  schedstat_set(curr->exec_max, max((u64)delta_exec, curr->exec_max));  curr->sum_exec_runtime += delta_exec;  schedstat_add(cfs_rq, exec_clock, delta_exec);  delta_exec_weighted = calc_delta_fair(delta_exec, curr);  curr->vruntime += delta_exec_weighted;  update_min_vruntime(cfs_rq);}

update_curr()是由系统定时器周期性调用的，无论是在进程处于可运行态，还是被堵塞处于不可运行态.根据这种方式，vruntime可以准确地测量给定进程的运行时间，而且可知道谁应该是下一个被运行的进程。

进程选择

CFS需要选择下一个运行进程时，它会挑一个具有最小vruntime的进程。这其实就是CFS调度算法的核心;选择具有最小vruntime的任务。

CFS使用红黑树来组织可运行进程队列，井利用其迅速找到最小vruntime。值的进程。在Linux中，红黑树称为rbs树它是一个自平衡二叉搜索树。我们将在第6章讨论自平衡二叉树以及红黑树。

选择下一个任务

假设我们已经有一颗红黑树，因为红黑树是二叉树所以最小值应该是在最左下角。(递归左子树)。__pick_next_entity()

static struct sched_entity *__pick_next_entity(struct cfs_rq *cfs_rq){  struct rb_node *left = cfs_rq->rb_leftmost;  if (!left)    return NULL;  return rb_entry(left, struct sched_entity, run_node);}

函数本身并不会遮历树找到最左叶子节点，因为该值已经缓存在rb_leftmost字段中。

向树中加入进程

在进程被唤醒或者fork()的时候发生。

static voidenqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags){    /*    * Update the normalized vruntime before updating min_vruntime    * through callig update_curr().    */    if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE))        se->vruntime += cfs_rq->min_vruntime;    /*    * Update run-time statistics of the ‘current’.    */    update_curr(cfs_rq);    account_entity_enqueue(cfs_rq, se);      if (flags & ENQUEUE_WAKEUP) {      place_entity(cfs_rq, se, 0);      enqueue_sleeper(cfs_rq, se);    }      update_stats_enqueue(cfs_rq, se);    check_spread(cfs_rq, se);      if (se != cfs_rq->curr)        __enqueue_entity(cfs_rq, se);}

This function updates the runtime and other statistics and then invokes __enqueue_entity() to perform the actual heavy lifting of inserting the entry into the red-black tree:

/** Enqueue an entity into the rb-tree:*/static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se){    struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;    struct rb_node *parent = NULL;    struct sched_entity *entry;    s64 key = entity_key(cfs_rq, se);    int leftmost = 1;    /*    * Find the right place in the rbtree:    */    while (*link) {        parent = *link;        entry = rb_entry(parent, struct sched_entity, run_node);        /*        * We dont care about collisions. Nodes with        * the same key stay together.        */        if (key < entity_key(cfs_rq, entry)) {            link = &parent->rb_left;        } else {          link = &parent->rb_right;          leftmost = 0;        }    }    /*    * Maintain a cache of leftmost tree entries (it is frequently    * used):    */    if (leftmost)        cfs_rq->rb_leftmost = &se->run_node;      rb_link_node(&se->run_node, parent, link);    rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);}

我们来看看上述函数，while()循环中遍历树以寻找合适的匹配键值，该值就是被插入进程的vruntime。平衡二叉树的基本规则是，如果键值小于当前节点的键值，则需转向树的左分支;相反如果大于当前节点的键值，则转向右分支。如果一旦走过右边分支，哪怕一次，也说明插入的进程不会是新的最左节点，因此可以设置rb_leftmost为0如果一直都是向左移动，那么rb_leftmost维持1，这说明我们有一个新的最左节点，并且可以更新缓存—设置比rb_leftmost指向被插入的进程。当我们沿着一个方向和一个没有子节的节点比较后:link如果这时是NULL，循环随之终止。当退出循环后，接着在父节点上调用rb-link node,以使得新插入的进程成为其子节点。

从树中删除进程

static voiddequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int sleep){    /*    * Update run-time statistics of the ‘current’.    */    update_curr(cfs_rq);      update_stats_dequeue(cfs_rq, se);    clear_buddies(cfs_rq, se);      if (se != cfs_rq->curr)        __dequeue_entity(cfs_rq, se);      account_entity_dequeue(cfs_rq, se);    update_min_vruntime(cfs_rq);    /*    * Normalize the entity after updating the min_vruntime because the    * update can refer to the ->curr item and we need to reflect this    * movement in our normalized position.    */    if (!sleep)        se->vruntime -= cfs_rq->min_vruntime;}

和给红黑树添加进程一样，实际工作是由辅助函数完成的。

static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se){    if (cfs_rq->rb_leftmost == &se->run_node) {      struct rb_node *next_node;      next_node = rb_next(&se->run_node);      cfs_rq->rb_leftmost = next_node;    }    rb_erase(&se->run_node, &cfs_rq->tasks_timeline);}

调度器入口

调度器就是选择哪个进程运行。主要入口函数为schedule().通常都需要和一个具体的调度类相关联，也就是说，它会找到一个最高优先级的调度类—后者需要有自己的可运行队列，然后问后者谁才是下一个该运行的进程。知道了这个背景，就不会吃惊schedule()函数为何实现得如此简单。该函数中唯一重要的事情是pick_next_task().

/** Pick up the highest-prio task:*/static inline struct task_struct *pick_next_task(struct rq *rq){    const struct sched_class *class;    struct task_struct *p;    /*    * Optimization: we know that if all tasks are in    * the fair class we can call that function directly:    */    if (likely(rq->nr_running == rq->cfs.nr_running)) {         p = fair_sched_class.pick_next_task(rq);              if (likely(p))            return p;    }      class = sched_class_highest;    for ( ; ; ) {      p = class->pick_next_task(rq);      if (p)        return p;      /*      * Will never be NULL as the idle class always      * returns a non-NULL p:      */      class = class->next;    }}

该函数的核心是for()循环，它以优先级为序，从最高的优先级类开始，遍历了每一个调度类。每一个调度类都实现了pick_next_task函数，它会返回指向下一个可运行进程的指针，或者没有时返回NULL。我们会从第一个返回非NULL值的类中选择下一个可运行进程。CF中pick_next_task()实现会调用pick next entity()，而该函数会再来调用我们前面内容中讨论过的pick next entity()。

睡眠和唤醒

Sleep或者说block状态是因为缺少资源可能是文件I\O，或者是硬件时间。从可执行红黑树中取走，放到等待队列中去。唤醒的过程刚好相反:进程被设置为可执行状态，然后再从等待队列中移到可执行红黑树中。
休眠有两种状态，breakable unbreakable。

等待队列

Some simple interfaces for sleeping used to be in wide use.These interfaces, however,have races: It is possible to go to sleep after the condition becomes true. In that case, thetask might sleep indefinitely.Therefore, the recommended method for sleeping in the kernelis a bit more complicated:

/* ‘q’ is the wait queue we wish to sleep on */DEFINE_WAIT(wait);add_wait_queue(q, &wait);while (!condition) { /* condition is the event that we are waiting for */  prepare_to_wait(&q, &wait, TASK_INTERRUPTIBLE);  if (signal_pending(current))  /* handle signal */  schedule();}finish_wait(&q, &wait);

Creates a wait queue entry via the macro DEFINE_WAIT().
Adds itself to a wait queue via add_wait_queue().This wait queue awakens theprocess when the condition for which it is waiting occurs. Of course, there needsto be code elsewhere that calls wake_up() on the queue when the event actuallydoes occur.
Calls prepare_to_wait() to change the process state to eitherTASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE.This function also adds the taskback to the wait queue if necessary, which is needed on subsequent iterations ofthe loop.
If the state is set to TASK_INTERRUPTIBLE, a signal wakes the process up.This iscalled a spurious wake up (a wake-up not caused by the occurrence of the event). Socheck and handle signals.
When the task awakens, it again checks whether the condition is true. If it is, itexits the loop. Otherwise, it again calls schedule() and repeats.
Now that the condition is true, the task sets itself to TASK_RUNNING and removesitself from the wait queue via finish_wait().

inotify_read()实例。

static ssize_t inotify_read(struct file *file, char __user *buf,size_t count, loff_t *pos){    struct fsnotify_group *group;    struct fsnotify_event *kevent;    char __user *start;    int ret;    DEFINE_WAIT(wait);      start = buf;    group = file->private_data;    while (1) {      prepare_to_wait(&group->notification_waitq,      &wait,      TASK_INTERRUPTIBLE);      mutex_lock(&group->notification_mutex);      kevent = get_one_event(group, count);      mutex_unlock(&group->notification_mutex);      if (kevent) {            ret = PTR_ERR(kevent);      if (IS_ERR(kevent))            break;              ret = copy_event_to_user(group, kevent, buf);      fsnotify_put_event(kevent);              if (ret < 0)            break;      buf += ret;      count -= ret;      continue;      }      ret = -EAGAIN;      if (file->f_flags & O_NONBLOCK)            break;      ret = -EINTR;      if (signal_pending(current))            break;      if (start != buf)            break;      schedule();    }    finish_wait(&group->notification_waitq, &wait);      if (start != buf && ret != -EFAULT)            ret = buf - start;    return ret;}

唤醒
唤醒操作通过函数wake_up()进行，它会唤醒指定的等待队列上的所有进程。它调用函数try_to_wake_up(),该函数负责将进程设置为TASK_RUNNING状态，调用enqueue_task()将此进程放人红黑树中，如果被唤醒的进程优先级比当前正在执行的进程的优先级高，还要设置need_resched标志。通常哪段代码促使等待条件达成，它就要负责随后调用wake_up()函数。举例来说，当磁盘数据到来时， VFS就要负责对等待队列调用wake_up()，以便唤醒队列中等待这些数据的进程。

抢占和上下文切换

当一个进程切换到另一个进程的时候就会调用context_switch()

·调用switch_mm(),该函数负责把虚拟内存从上一个进程映射切换到新进程中。
调用switch_to()该函数负责从上一个进程的处理器状态切换到新进程的处理器状态。这包括保存、恢复栈信息和寄存器信息，还有其他任何与体系绍构相关的状态信息，都必须以每个进程为对象进行管理和保存。
内核必须知道在什么时候调用schedule())。如果仅靠用户程序代码显式地调用schedule()，它们可能就会永远地执行下去。相反，内核提供了一个need_scheduled标志来表明是否需要重新执行一次调度(见表4-1)。当某个进程应该被抢占时，scheduleres tick.就会设置这个标志:当一个优先级高的进程进入可执行状态的时候，try-to-wake_up())也会设置这个标志，内核检查该标志，确认其被设置，调用schedule()来切换到一个新的进程。该标志对于内核来讲是一个信息，它表示有其他进程应当被运行了，要尽快调用调度程序。
用户抢占
内核即将返回用户空间的时候，如果need_resched标志被设置，会导致schedule()被调用，此时就会发生用户抡占。在内核返回用户空间的时候，它知道自己是安全的，因为既然它可以继续去执行当前进程，那么它当然可以再去选择一个新的进程去执行。所以，内核无论是在中断处理程序还是在系统调用后返回，都会检查need_resched标志。
- 从系统调返回用户空间时。
- 从中断处理程序返回用户空间时。
内核抢占
不支持内核抢占的内核中，内核代码可以一直执行，到它完成为止。也就是说，调度程序没有办法在一个内核级的任务正在执行的时候重新调度—内核中的各任务是以协作方式调度的，不具备抢占性。只要内核抢占是安全的，就可以进行内核抢占。
The first change in supporting kernel preemption was the addition of a preemption counter, preempt_count, to each process’s thread_info.This counter begins at zero and increments once for each lock that is acquired and decrements once for each lock that is released.When the counter is zero, the kernel is preemptible. Upon return from interrupt, if returning to kernel-space, the kernel checks the values of need_resched andpreempt_count. If need_resched is set and preempt_count is zero, then a more important task is runnable, and it is safe to preempt.Thus, the scheduler is invoked. If preempt_count is nonzero, a lock is held, and it is unsafe to reschedule. In that case, theinterrupt returns as usual to the currently executing task.When all the locks that the currenttask is holding are released, preempt_count returns to zero.At that time, the unlock code checks whether need_resched is set. If so, the scheduler is invoked.
内核抢占发生在：
- 内核代码再一次具有可抢占性的时候。
- 如果内核中的任务显式地调用schedule().
- 如果内核中的任务阻塞(这同样也会导致调用schedule()).
实时调度策略
SCHED_FIFO实现了一种简单的、先入先出的调度算法:它不使用时间片，处于可运行状态的SCHED_FIFO级的进程会比任何SCHED_NORMAL级的进程都先得到调度。一旦一个SCHAD_FIFO级进程处于可执行状态，就会一直执行，直到它自己受阻塞或显式地释放处理器为止;它不基于时间片，可以一直执行下去。只有更高优先级的SCHED_FIFO或者SCHED_RR任务才能抢占SCHED_FIFO任务。如果有两个或者更多的同优先级的SCHED_FIFO级进程，它们会轮流执行，但是依然只有在它们愿意让出处理器时才会退出。只要有SCHIED_FIFO级进程在执行，其他级别较低的进程就只能等待它变为不可运行态后才有机会执行。
SCHED_RR是基于时间片的FIFO算法。
Linux的实时调度算法提供了一种软实时工作方式。软实时的含义是，内核调度进程，尽力使进程在它的限定时间到来前运行。
实时优先级范围[0,MAX_RT_PRIO - 1].MAX_RT_PRIO为100。SCHED_NORMAL级进程共享这个取值空间。[MAX_RT_PRIO, MAX_RT_PRIO + 40].
与调度相关的系统调用
Linux提供了一个系统调用族，用于管理与调度程序相关的参数。这些系统调用可以用来操作和处理进程优先级、调度策略及处理器绑定，同时还提供了显式地将处理器交给其他进程的机制。