定时器函数执行原理揭秘

来源:互联网 发布:摄影美工培训 编辑:程序博客网 时间:2024/05/20 20:46
  • 我们总是在无法满足条件的时候调用sleep()是系统睡眠,在编写网络程序的时候,总是调用poll、select,想明白内核是如何实现基于精确时间的调度操作吗?或许我们该交流交流...
    1 前言
    延期执行有两种:
    第一种是不需要精确地时间控制,比如软中断和tasklet机制,在每个异步中断处理结束时处理或者调用内核线程
             ksoftirqd执行。
    第二种需要精确地时间控制,像工作队列[内核线程keventd完成]、等待队列[生产者激活]、完成量等涉及到进程休眠
             等待的结构体
    都要依靠定时器机制,在某个精确地时间间隔后,由内核执行某个延期操作
    定时器类型
       低分辨率定时器:典型分辨率为1ms,由PIT(可编程中断定时器 8253芯片构成)
       高分辨率定时器:可达到ns级的分辨率,如声卡驱动程序可能需要很短的周期间隔向声卡发送一些数据
    由于定时器引发的周期时钟在内核整个生命活动周期内都是活动的,系统不会长时间进入省电模式,从而引入动态时钟
    目前有低分辨率动态时钟和高分辨率动态时钟两种,内核中这四种类别的所有可能组合都是有效地
    3 低分辨率定时器的实现
       在IA-32系统中,一般选用HPET或者PIT作为时钟中断的周期性时钟源,中断每秒大概100次;高HZ适合用交互式应用比较频繁的桌面系统和多媒体系统应用,低HZ适合于服务器和批处理机器。
    实现概览:
    3-1 时钟时间设备初始化
    现在可以系统的介绍内核时钟子系统的初始化过程。系统刚上电时,需要注册 IRQ0 时钟中断,完成时钟源设备,时钟事件设备,tick device 等初始化操作并选择合适的工作模式。由于刚启动时没有特别重要的任务要做,因此默认是进入低精度 + 周期 tick 的工作模式,之后会根据硬件的配置(如硬件上是否支持 HPET 等高精度 timer)和软件的配置(如是否通过命令行参数或者内核配置使能了高精度 timer 等特性)进行切换。在一个支持 hrtimer 高精度模式并使能了 dynamic tick 的系统中,第一次发生 IRQ0 的软中断时 hrtimer 就会进行从低精度到高精度的切换,然后再进一步切换到 NOHZ 模式。IRQ0 为系统的时钟中断,使用全局的时钟事件设备(global_clock_event)来处理的,其定义如下:

     static struct irqaction irq0  = {         .handler = timer_interrupt,         .flags = IRQF_DISABLED | IRQF_NOBALANCING | IRQF_IRQPOLL | IRQF_TIMER,         .name = "timer" }; 

    它的中断处理函数 timer_interrupt 的简化实现如清单 12 所示:

     static irqreturn_t timer_interrupt(int irq, void *dev_id)  {  . . . .         global_clock_event->event_handler(global_clock_event); . . . .         return IRQ_HANDLED;  } 

    在 global_clock_event->event_handler 的处理中,除了更新 local CPU 上运行进程时间的统计,profile 等工作,更重要的是要完成更新 jiffies 等全局操作。这个全局的时钟事件设备的 event_handler 根据使用环境的不同,在低精度模式下可能是 tick_handle_periodic / tick_handle_periodic_broadcast,在高精度模式下是 hrtimer_interrupt。目前只有 HPET 或者 PIT 可以作为 global_clock_event 使用。其初始化流程清单 13 所示:

     void __init time_init(void)  {         late_time_init = x86_late_time_init;  }  static __init void x86_late_time_init(void)  {         x86_init.timers.timer_init();         tsc_init();  }  /* x86_init.timers.timer_init 是指向 hpet_time_init 的回调指针 */  void __init hpet_time_init(void)  {         if (!hpet_enable())                 setup_pit_timer();         setup_default_timer_irq();  } 


    由上图 可以看到,系统优先使用 HPET 作为 global_clock_event,只有在 HPET 没有使能时,PIT 才有机会成为 global_clock_event。在使能 HPET 的过程中,HPET 会同时被注册为时钟源设备和时钟事件设备。

     hpet_enable     hpet_clocksource_register  hpet_legacy_clockevent_register     clockevents_register_device(&hpet_clockevent); 

    clockevent_register_device 会触发 CLOCK_EVT_NOTIFY_ADD 事件,即创建对应的 tick_device。然后在 tick_notify 这个事件处理函数中会添加新的 tick_device。

     clockevent_register_device trigger event CLOCK_EVT_NOTIFY_ADD  tick_notify receives event CLOCK_EVT_NOTIFY_ADD     tick_check_new_device         tick_setup_device 

    在 tick_device 的设置过程中,会根据新加入的时钟事件设备是否使用 broadcast 来分别设置 event_handler。对于 tick device 的处理函数,可下图 所示:

     low resolution modeHigh resolution modeperiodic ticktick_handle_periodichrtimer_interruptdynamic tick                   tick_nohz_handlerhrtimer_interrupt

    所以在低分辨率且周期性时钟中断中,采用tick_handle_periodic作为时钟中断的处理函数,每次时钟中断,都执行该函数一次调用一下函数:
    tick_handle_periodic(struct clock_event_device* dev)  /Tick-common.c
       -> tick_periodic(cpu)
                   -> do_timer(1)  1 为 1 个jiffies
                   -> update_process_times(user_mode(get_irq_regs())) 通过寄存器来查看是否位于用户态
                   -> profile_tick(CPU_PROFILING)
    3-2 该事件处理程序执行下面两个步骤:
    do_timer() --主cpu执行:更新jiffies、系统墙上时间、系统负载(平均队列长度)
    update_process_times()--每个cpu执行:更新当前进程utime、stime、调用的软中断,激活该cpu上定时器函数、
                                        cpu调度、运行当前注册的posix定时器
    3-3 do_timer()
    /*
     * The 64-bit jiffies value is not atomic - you MUST NOT read it
     * without sampling the sequence number in xtime_lock.
     * jiffies is defined in the linker script...
     */
    若采用动态时钟,ticks可能大于1,这样有利于节省电能
    void do_timer(unsigned long ticks)
    {
       jiffies_64 += ticks;
       update_wall_time();  //待探索
       calc_global_load();  
    /*系统负载定义:
    1 系统平均负载:在特定时间间隔内运行队列中的平均进程数[运行队列长度],一般来说只要每个CPU的当
    前活动进程数不大于3 那么系统的性能就是良好的,如果每个CPU的任务数大于5,那么就表示这台机器的性能有严重问题.
    2 计算方式: 当前负载:load(t) = [ [5.1s前统计的负载] load(t-1) e-5/60*m + n* (2048 - e-5/60)] >> 2048,
    n是系统此刻活动的进程数(包括就绪态和TASK_UNINTERRUPE状态进程)
    m表示前m分钟内的系统平均负载,e-5/60 = 1884 e-5/60*5 = 2014
    2048 表示作为精度precision标准
    */
    }
    3-4 update_process_times(int user_tick)    每个cpu统计值更新,目前仍在一步中断处理过程中
    /*
     * Called from the timer interrupt handler to charge one tick to the current
     * process.  user_tick is 1 if the tick is user time, 0 for system.
     */
    void update_process_times(int user_tick)
    {
     struct task_struct *p = current;
     int cpu = smp_processor_id();

     /* Note: this timer irq context must be accounted for as well. */
     account_process_tick(p, user_tick);
     run_local_timers();
     rcu_check_callbacks(cpu, user_tick);
     printk_tick();
     scheduler_tick();
     run_posix_cpu_timers(p);
    }
    3-4-1:更新cpu时间:
    每个时钟中断,动态时钟中可能若干次时钟中断才触发一次中断,所以描述的是每次触发的时钟中断
    更新当前进程utime += 1000000000/HZ [每次中断加10的7次方次],更新线程组的运行时间,更新
    每cpu上的内核数据统计值,
    /*
     * 'kernel_stat.h' contains the definitions needed for doing
     * some kernel statistics (CPU usage, context switches ...),
     * used by rstatd/perfmeter
     */
    struct cpu_usage_stat {
     cputime64_t user; //cpustat->user = cputime64_add(cpustat->user, tmp);
     cputime64_t nice;
     cputime64_t system;
     cputime64_t softirq;
     cputime64_t irq;
     cputime64_t idle;
     cputime64_t iowait;
     cputime64_t steal;
     cputime64_t guest;
    };
    /*
     * Account a single tick of cpu time.
     * @p: the process that the cpu time gets accounted to
     * @user_tick: indicates if the tick is a user or a system tick
     */
    void account_process_tick(struct task_struct *p, int user_tick)
    {
     cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
     struct rq *rq = this_rq();

     if (user_tick)
      account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
     else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))
      account_system_time(p, HARDIRQ_OFFSET, cputime_one_jiffy,
            one_jiffy_scaled);
     else
      account_idle_time(cputime_one_jiffy);
    }
    3-4-2 触发软中断,从而处理该cpu上的定时器函数
    /*
     * Called by the local, per-CPU timer interrupt on SMP.
     */
    void run_local_timers(void)
    {
     hrtimer_run_queues();
     raise_softirq(TIMER_SOFTIRQ); 
    //给task_struct->preemt_count标记,并触发ksoftirqd内核线程,该线程执行的软中断函数分析见下面详解
     softlockup_tick();
    }
    3-4-3 进程调度
    /*
     * This function gets called by the timer code, with HZ frequency.
     * We call it with interrupts disabled.
     *
     * It also gets called by the fork code, when changing the parent's
     * timeslices.
     */
    void scheduler_tick(void)
    {
     int cpu = smp_processor_id();
     struct rq *rq = cpu_rq(cpu);
     struct task_struct *curr = rq->curr;

     sched_clock_tick();

     spin_lock(&rq->lock);
     update_rq_clock(rq);
     update_cpu_load(rq);
     curr->sched_class->task_tick(rq, curr, 0); 
     spin_unlock(&rq->lock);

     perf_event_task_tick(curr, cpu);

    #ifdef CONFIG_SMP
     rq->idle_at_tick = idle_cpu(cpu);
     trigger_load_balance(rq, cpu); //激活内核线程,执行软中断的SHED_SOFTIRQ,各个cpu间负载均衡
    #endif
    }
    4 软中断中处理定时器
    1 比较时间操作
    获取系统自开机以来的jiffies值
    static inline u64 get_jiffies_64(void)
    {
     return (u64)jiffies;
    }
    2 标准的时间比较函数,值均为jiffies值
    #define time_after(a,b)  \
     (typecheck(unsigned long, a) && \
      typecheck(unsigned long, b) && \
      ((long)(b) - (long)(a) < 0))
    #define time_before(a,b) time_after(b,a)

    #define time_after_eq(a,b) \
     (typecheck(unsigned long, a) && \
      typecheck(unsigned long, b) && \
      ((long)(a) - (long)(b) >= 0))
    #define time_before_eq(a,b) time_after_eq(b,a)

    /*
     * Calculate whether a is in the range of [b, c].
     */
    #define time_in_range(a,b,c) \
     (time_after_eq(a,b) && \
      time_before_eq(a,c))
    3 时间换算
    unsigned int inline jiffies_to_msecs(const unsigned long j)
    {
    #if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ)
     return (MSEC_PER_SEC / HZ) * j;
    //假设jiffies=1000,即自开机来执行了1000次,HZ=100,每秒运行100个时钟中断,则当前执行了10000ms=10s
    #elif HZ > MSEC_PER_SEC && !(HZ % MSEC_PER_SEC)
     return (j + (HZ / MSEC_PER_SEC) - 1)/(HZ / MSEC_PER_SEC);
    #else
    # if BITS_PER_LONG == 32
     return (HZ_TO_MSEC_MUL32 * j) >> HZ_TO_MSEC_SHR32;
    # else
     return (j * HZ_TO_MSEC_NUM) / HZ_TO_MSEC_DEN;
    # endif
    #endif
    }
    unsigned int inline jiffies_to_usecs(const unsigned long j)
    unsigned long msecs_to_jiffies(const unsigned int m)
    unsigned long usecs_to_jiffies(const unsigned int u)
    4 jiffies和timeval以及timespec之间转化
    时间在内核中以jiffies偏移量或绝对值表示,但是用户在定义定时函数时习惯按秒而不是HZ来思考,内核提供了这两种之间的时间换算
    程序员定义sleep(2)-->转为结构体
    struct timeval {time_t tv_sec;suseconds_t  tv_usec; 微秒}
    struct timespec {time_t tv_sec;long tv_nsec;纳秒}
                            -->转为相对jiffies
    /* Same for "timeval"
     *
     * Well, almost.  The problem here is that the real system resolution is
     * in nanoseconds and the value being converted is in micro seconds.
     * Also for some machines (those that use HZ = 1024, in-particular),
     * there is a LARGE error in the tick size in microseconds.

     * The solution we use is to do the rounding AFTER we convert the
     * microsecond part.  Thus the USEC_ROUND, the bits to be shifted off.
     * Instruction wise, this should cost only an additional add with carry
     * instruction above the way it was done above.
     */
    unsigned long
    timeval_to_jiffies(const struct timeval *value)
    {
     unsigned long sec = value->tv_sec;
     long usec = value->tv_usec;

     if (sec >= MAX_SEC_IN_JIFFIES){
      sec = MAX_SEC_IN_JIFFIES;
      usec = 0;
     }
     return (((u64)sec * SEC_CONVERSION) +
      (((u64)usec * USEC_CONVERSION + USEC_ROUND) >>
       (USEC_JIFFIE_SC - SEC_JIFFIE_SC))) >> SEC_JIFFIE_SC;
    }
    unsigned long timespec_to_jiffies(const struct timespec *value)
    void jiffies_to_timeval(const unsigned long jiffies, struct timeval *value)
    void jiffies_to_timespec(const unsigned long jiffies, struct timespec *value)
              --->转为绝对的jiffies
    m = jiffies + n;   jiffies 为内核中当前的jiffies值
    5 动态定时器原理
    5-1 数据结构
    定时器数据结构:
    struct timer_list {
     struct list_head entry;
     unsigned long expires;                    //绝对的jiffies值

     void (*function)(unsigned long);       //超时回调函数,一般为唤醒进程
     unsigned long data;                       //一般为进程task_struct

     struct tvec_base *base;                  //该cpu对应的定时器数组链表基地址
    #ifdef CONFIG_TIMER_STATS
     void *start_site;
     char start_comm[16];
     int start_pid;
    #endif
    #ifdef CONFIG_LOCKDEP
     struct lockdep_map lockdep_map;
    #endif
    };
    定时器数组链表结构
    struct tvec_base {
     spinlock_t lock;                           //保护定时器中断的锁结构
     struct timer_list *running_timer;   //暂存当前执行的所有定时器结构
     unsigned long timer_jiffies;         //上次的jiffies值
     unsigned long next_timer;           //下次调用的jiffies值
     struct tvec_root tv1;               //最低8位用于最近256个jiffies内的定时器链表[2.56s内的中断] 挂载在256个链表
     struct tvec tv2;                          //用于256-2(14)-1时间范围内        共挂载到64个链表中
    //每个链表挂载了256个时间间隔的时钟中断
     struct tvec tv3;                          //用于2(14) - 2(20)-1时间范围内    共挂载到64个链表中
     struct tvec tv4;                          //用于2(20) -2(26)-1时间范围内     共挂载到64个链表中
     struct tvec tv5;                          //用于2(26)-2(32)-1时间范围内      共挂载到64个链表中
    } ____cacheline_aligned;
    struct tvec_base boot_tvec_bases;
    static DEFINE_PER_CPU(struct tvec_base *, tvec_bases) = &boot_tvec_bases;
    为每个cpu定义一个定时器数组链表结构,初始化为boot_tvec_bases
    #define TVN_BITS (CONFIG_BASE_SMALL ? 4 : 6)
    #define TVR_BITS (CONFIG_BASE_SMALL ? 6 : 8)
    #define TVN_SIZE (1 << TVN_BITS)     64
    #define TVR_SIZE (1 << TVR_BITS)     256
    #define TVN_MASK (TVN_SIZE - 1)     
    #define TVR_MASK (TVR_SIZE - 1)
    struct tvec {
     struct list_head vec[TVN_SIZE];     //共64个链表
    };
    struct tvec_root {
     struct list_head vec[TVR_SIZE];     //共256个链表
    };
    5-2 动态定时器实现原理
    定时器数组链表共分为5个组,第一组256项,每项都是一个双链表,挂载0-255个时钟周期内要执行的定时器任务;其余组都是64项,每组都有64个双链表,以第二组为例,这64项个双链表中每个链表可容许的时间间隔为2(8)=256个时钟周期。对第三组,每个链表可容纳的时间间隔是2(14)个时钟周期。内核主要负责关注第一组定时器,内核在每个组中都有一个计数器,保存了该数组的当前位置编号,每当遇到定时中断,内核扫描第一个数组链表,执行特定位置的所有定时器函数,然后将技术器+1,如果达到256,重新归0,并将第二组中的当前位置的定时器列表填充第一个数组列表,然后第二组当前计数器+=1,依此类推,第三组填充第二组,当前计数器+=1 ...,由于在组间通过指针移动,效率相当高
       那么每组的当前位置如何体现呢,也是在struct tvec_base->timer_jiffies值体现,该值记录了一个时间点,此前到期的定时器都已经执行完毕,该timers_jiffies 一般等于或略微小于jiffies,每一组的索引位置计算如下:
    #define INDEX(N) ((base->timer_jiffies >> (TVR_BITS + (N) * TVN_BITS)) & TVN_MASK)
    注第一组为:base->timer_jiffies & TVR_MASK
    第二组的N=0为:INDEX(0) ((base->timer_jiffies >> (TVR_BITS ) & TVN_MASK)
    第三组的N=1为:INDEX(1) ((base->timer_jiffies >> (TVR_BITS +1*TVN_BITS) & TVN_MASK)
    ...
    5-3 将定时器挂载到每cpu定时器数组链表中
    1 定义一个定时器
    动态定义:
    #define DEFINE_TIMER(_name, _function, _expires, _data)  \
     struct timer_list _name =    \
      TIMER_INITIALIZER(_function, _expires, _data)
    #define TIMER_INITIALIZER(_function, _expires, _data) {  \
      .entry = { .prev = TIMER_ENTRY_STATIC }, \
      .function = (_function),   \
      .expires = (_expires),    \
      .data = (_data),    \
      .base = &boot_tvec_bases,   \
      __TIMER_LOCKDEP_MAP_INITIALIZER(  \
       __FILE__ ":" __stringify(__LINE__)) \
     }
    静态定义:
    struct timer_list my_timer;
    init_timer(&my_timer)
    my_timer.expires = jiffies + 1*HZ
    my_timer.data = &my_timer;
    my_timer.function = my_function;   void  my_function(unsigned long data)

    2 激活定时器  将定时器加入到当前cpu对应的定时器数组列表中
    2-1
    /**
     * add_timer - start a timer
     * @timer: the timer to be added
     *
     * The kernel will do timer ->function(->data) callback from the
     * timer interrupt at the ->expires point in the future. The
     * current time is 'jiffies'.
     *
     * The timer's ->expires, ->function (and if the handler uses it, ->data)
     * fields must be set prior calling this function.
     *
     * Timers with an ->expires field in the past will be executed in the next
     * timer tick.
     */
    void add_timer(struct timer_list *timer)
    {
         BUG_ON(timer_pending(timer));     //确保该timer->enter.next = NULL,即该timer尚未挂载到内核中
         mod_timer(timer, timer->expires);  //其中的expires为jiffies绝对值
    }
    2-2 mod_timer  修改timer值的超时时间expires值,如果尚未加入列表则加入其中
    /**
     * mod_timer - modify a timer's timeout
     * @timer: the timer to be modified
     * @expires: new timeout in jiffies
     *
     * mod_timer() is a more efficient way to update the expire field of an
     * active timer (if the timer is inactive it will be activated)
     *
     * mod_timer(timer, expires) is equivalent to:
     *
     *     del_timer(timer); timer->expires = expires; add_timer(timer);
     *
     * Note that if there are multiple unserialized concurrent users of the
     * same timer, then mod_timer() is the only safe way to modify the timeout,
     * since add_timer() cannot modify an already running timer.
     *
     * The function returns whether it has modified a pending timer or not.
     * (ie. mod_timer() of an inactive timer returns 0, mod_timer() of an
     * active timer returns 1.)
     */
    int mod_timer(struct timer_list *timer, unsigned long expires) //expires值为绝对值
    {
     /*
      * This is a common optimization triggered by the
      * networking code - if the timer is re-modified
      * to be the same thing then just return:
      */
     if (timer_pending(timer) && timer->expires == expires)
      return 1;
     
    return __mod_timer(timer, expires, false, TIMER_NOT_PINNED);
    }
    2-3  __mod_timer
    static inline int
    __mod_timer(struct timer_list *timer, unsigned long expires,
          bool pending_only, int pinned)
    {
     struct tvec_base *base, *new_base;
     unsigned long flags;
     int ret = 0 , cpu;

     timer_stats_timer_set_start_info(timer); //记录该timer对应的进程信息,后面再执行该timer函数时,会收集插入该定时器的进程的相应信息,便于执行某些操作
    /*
    void __timer_stats_timer_set_start_info(struct timer_list *timer, void *addr=0)
    {
     if (timer->start_site)
      return;
     
    timer->start_site = addr;
     memcpy(timer->start_comm, current->comm, TASK_COMM_LEN);
     timer->start_pid = current->pid;
    }
    */

     BUG_ON(!timer->function);

     base = lock_timer_base(timer, &flags); 
    // 给该cpu定时器数组队列struct tvec_base加锁per_cpu(tvec_bases).lock

     if (timer_pending(timer)) {      //timer已经加入到定时器队列中
      detach_timer(timer, 0);
      if (timer->expires == base->next_timer &&
          !tbase_get_deferrable(timer->base))
       base->next_timer = base->timer_jiffies;
      ret = 1;
     } else {
      if (pending_only)
       goto out_unlock;
     }

     debug_activate(timer, expires);

     new_base = __get_cpu_var(tvec_bases);

     cpu = smp_processor_id();

    #if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP)
     if (!pinned && get_sysctl_timer_migration() && idle_cpu(cpu)) {
      int preferred_cpu = get_nohz_load_balancer();

      if (preferred_cpu >= 0)
       cpu = preferred_cpu;
     }
    #endif
     new_base = per_cpu(tvec_bases, cpu);

     if (base != new_base) {
      /*
       * We are trying to schedule the timer on the local CPU.
       * However we can't change timer's base while it is running,
       * otherwise del_timer_sync() can't detect that the timer's
       * handler yet has not finished. This also guarantees that
       * the timer is serialized wrt itself.
       */
      if (likely(base->running_timer != timer)) {
       /* See the comment in lock_timer_base() */
       timer_set_base(timer, NULL);
       spin_unlock(&base->lock);
       base = new_base;
       spin_lock(&base->lock);
       timer_set_base(timer, base);
      }
     }

     timer->expires = expires;
     if (time_before(timer->expires, base->next_timer) &&
         !tbase_get_deferrable(timer->base))
      base->next_timer = timer->expires;
     internal_add_timer(base, timer);

    out_unlock:
     spin_unlock_irqrestore(&base->lock, flags);

     return ret;
    }
    2.4  internal_add_timer 激活定时器主函数
    static void internal_add_timer(struct tvec_base *base, struct timer_list *timer)
    {
     unsigned long expires = timer->expires; //一般大于jiffies, base->timer_jiffies一般=jiffies-1
     unsigned long idx = expires - base->timer_jiffies;  //相对当前timer_jiffies时间
     struct list_head *vec;

     if (idx < TVR_SIZE) {                   //相对当前jiffies时间在255之内
      int i = expires & TVR_MASK;   
    /*
    *确定在第一组中的插入的列表项,比如base_jiffies=254,timer->expires = 259, idx=5,即在未来5个时钟周期触发
    *当前第一组的当前位置 = timer_jiffies =254,则按理应该将该timer插入到第一组第3项,而i=259&256=3,即插入*到第一组3项,正确
    */
      vec = base->tv1.vec + i;
     }
    else if (idx < 1<< (TVR_BITS + TVN_BITS)) {       //相对当前jiffies时间在256 -  2(14)-1之内
         int i = (expires >> TVR_BITS) & TVN_MASK; 
          vec = base->tv2.vec + i;
    /*
    * 关键思想是:timer->expires[求出未来执行的定时器当前插入的位置] > jiffies > timer_jiffies[作为当前定时器执
    * 行时每项数组下标标准],由于timer->expires 总是 > timer_jiffies,所以插入的位置在各组中总是相对当前定时器
    * 而言的
    * 依然假设base_jiffies = 254, timer->expires = 512,idx = 258 即距离258个时钟中断触发
    * i= (512 >> 8)&63 = 2,即插入第二个数组的第二项中
    * 当前第一组的当前位置 = timer_jiffies =254,第二组当前位置为IDN(0)=0,
    * 在第二个时钟中断时,timer_jiffies=256,从而将第二组中的当前下
    * 标处移动到第一组中,第二组的下标+=1;在第258个时钟中断时,再将第二组的下一个下标项移动到第一组中。由于
    * 此时在第一组的下标为0,而当第二组的下一个下标移入第一组时,根据timer->expires= 512%256 = 0从而放入第
    * 一组的第0项, 执行到期任务。
    */
     } else if (idx < 1 << (TVR_BITS + 2 * TVN_BITS)) {   
      int i = (expires >> (TVR_BITS + TVN_BITS)) & TVN_MASK;
      vec = base->tv3.vec + i;
     } else if (idx < 1 << (TVR_BITS + 3 * TVN_BITS)) {
      int i = (expires >> (TVR_BITS + 2 * TVN_BITS)) & TVN_MASK;
      vec = base->tv4.vec + i;
     } else if ((signed long) idx < 0) {
      /*
       * Can happen if you add a timer with expires == jiffies,
       * or you set a timer to go off in the past
       */
      vec = base->tv1.vec + (base->timer_jiffies & TVR_MASK);
     }
    else
    {
      int i;
      /* If the timeout is larger than 0xffffffff on 64-bit
       * architectures then we use the maximum timeout:
       */
      if (idx > 0xffffffffUL) {
       idx = 0xffffffffUL;
       expires = idx + base->timer_jiffies;
      }
      i = (expires >> (TVR_BITS + 3 * TVN_BITS)) & TVN_MASK;
      vec = base->tv5.vec + i;
     }
     /*
      * Timers are FIFO:
      */
     list_add_tail(&timer->entry, vec);
    }
    3 定时器软中断,处理该cpu上的定时器
    由上文可知,update_process_times时统计完执行进程时间信息后,执行run_local_timers函数
    3-1 激活时钟处理软中断
    /*
     * Called by the local, per-CPU timer interrupt on SMP.
     */
    void run_local_timers(void)
    {
     hrtimer_run_queues();
     raise_softirq(TIMER_SOFTIRQ);
     softlockup_tick();
    }
    void raise_softirq(unsigned int nr)
    {
     unsigned long flags;

     local_irq_save(flags);
     raise_softirq_irqoff(nr);
     local_irq_restore(flags);
    }
    inline void raise_softirq_irqoff(unsigned int nr)
    {
     __raise_softirq_irqoff(nr);   //设置该cpu相关结构中的softirq_pending向量,共32项,表明该号软中断有内容处理
                                            local_cpu_data->softirq_pending
    /*
      * If we're in an interrupt or softirq, we're done
      * (this also catches softirq-disabled code). We will
      * actually run the softirq once we return from
      * the irq or softirq.
      *
      * Otherwise we wake up ksoftirqd to make sure we
      * schedule the softirq soon.
      */
     if (!in_interrupt())
      wakeup_softirqd(); //唤醒内核线程,执行或者在中断处理末尾执行软中断
    }
    3-2 时钟处理软中断函数执行,与时钟中断相关的软中断执行函数为:
     /*
     * This function runs timers and the timer-tq in bottom half context.
     */
    static void run_timer_softirq(struct softirq_action *h)
    {
     struct tvec_base *base = __get_cpu_var(tvec_bases);

     perf_event_do_pending();

     hrtimer_run_pending();

     if (time_after_eq(jiffies, base->timer_jiffies))
              __run_timers(base);
    }
    3-3 调用定时器处理函数
    /**
     * __run_timers - run all expired timers (if any) on this CPU.
     * @base: the timer vector to be processed.
     *
     * This function cascades all vectors and executes all expired timer
     * vectors.
     */
    static inline void __run_timers (struct tvec_base *base)
    {
     struct timer_list *timer;

     spin_lock_irq(&base->lock);
     while (time_after_eq(jiffies, base->timer_jiffies)) {   
    //如果当前的jiffies 大于上次调用的timer_jiffies,每循环一次,timer_jiffies+=1,知道=jiffies为止
      struct list_head work_list;
      struct list_head *head = &work_list;
      int index = base->timer_jiffies & TVR_MASK;
        //本次调用时的jiffies值对应的第一组的下标
      /*
       * Cascade timers: 级联处理,主要关注第一组,当第一组下标循环到0,即每当timer_jiffies%256=0时,
       *                        将上一组中的一项分流到该组中,依此类推,INDEX(n)= 第n+2组中的当前下标
       */
      if (!index &&
       (!cascade(base, &base->tv2, INDEX(0))) &&       
        (!cascade(base, &base->tv3, INDEX(1))) &&
         !cascade(base, &base->tv4, INDEX(2)))
       cascade(base, &base->tv5, INDEX(3));
      ++base->timer_jiffies;  //每处理一次,定时器+=1,向前进1位
      list_replace_init(base->tv1.vec + index, &work_list);
    //这样第一组中一般肯定会有来自其他组填充的定时器,将该项清空,并且将work_list填充为该项定时器列表
      while (!list_empty(head)) {
       void (*fn)(unsigned long);
       unsigned long data;

       timer = list_first_entry(head, struct timer_list,entry);   //对于每个定时器结构,处理
       fn = timer->function;
       data = timer->data;

       timer_stats_account_timer(timer);  //统计信息,填充struct entry,跟踪执行该timer的进程信息

       set_running_timer(base, timer);
       detach_timer(timer, 1);        //将该timer从双向链表中删除出去

       spin_unlock_irq(&base->lock);
       {
        int preempt_count = preempt_count();

    #ifdef CONFIG_LOCKDEP
        /*
         * It is permissible to free the timer from
         * inside the function that is called from
         * it, this we need to take into account for
         * lockdep too. To avoid bogus "held lock
         * freed" warnings as well as problems when
         * looking into timer->lockdep_map, make a
         * copy and use that here.
         */
        struct lockdep_map lockdep_map =
         timer->lockdep_map;
    #endif
        /*
         * Couple the lock chain with the lock chain at
         * del_timer_sync() by acquiring the lock_map
         * around the fn() call here and in
         * del_timer_sync().
         */
        lock_map_acquire(&lockdep_map);

        trace_timer_expire_entry(timer);
        fn(data);  //执行处理函数
        trace_timer_expire_exit(timer);

        lock_map_release(&lockdep_map);

        if (preempt_count != preempt_count()) {
         printk(KERN_ERR "huh, entered %p "
                "with preempt_count %08x, exited"
                " with %08x?\n",
                fn, preempt_count,
                preempt_count());
         BUG();
        }
       }
       spin_lock_irq(&base->lock);
      }
     }
     set_running_timer(base, NULL);
     spin_unlock_irq(&base->lock);
    }
    /*
    * 作用:将该数组列表中的index项重新插入到定时器数组列表中,由于timers->jiffies此时已变化,所以插入位置也改
               变
    * base  该定时器数组列表
    * tv      该组数组列表
    * index  timer_jiffies目前在该数组列表中的下标,可以为0
    * 当该tvec组满时返回0
    */
    static int cascade(struct tvec_base *base, struct tvec *tv, int index)
    {
     /* cascade all the timers from tv up one level */
     struct timer_list *timer, *tmp;
     struct list_head tv_list;

     list_replace_init(tv->vec + index, &tv_list);

     /*
      * We are removing _all_ timers from the list, so we
      * don't have to detach them individually.
      */
     list_for_each_entry_safe(timer, tmp, &tv_list, entry) {
      BUG_ON(tbase_get_base(timer->base) != base);
      internal_add_timer(base, timer);
     }
     return index;
    }

0 0