Linux Kernel时间子系统之来龙去脉浅析

来源：互联网发布：淘宝严重违规48分申诉编辑：程序博客网时间：2024/05/21 09:52

1.1 从AP来看Linux系统提供的时间机制

1.1.1 Time的表达方式

在Linux中从AP来看，时间有两种表达方式：

1. wall-clock time：自从1970年1月1日0时0分0秒以来到现在的时间差，理论上精确到nano-second。我们称这个时间点为Linux基准时间（或者linux epoch）。此种表达方式有利于software实现。User Application可以使用time()/stime()，gettimeofday()/settimeofday()来获得；前者精确到second，后者理论上到nano-second。（ARM系统中一般到10ms）。

2. 现在的绝对时间，即：年，月，日，时，分，秒。这是User最终想看到的时间；RTC时间也如此。User Application可以使用gmtime()/mktime()，ctime()/asctime()，loactime()来进行两个表达方式的转换。（注意：这些都不是thread safe的，thread safe的版本：xxx_r，具体信息大家可以man一下就知道了。Why？）。

1.1.2 AP Interval Timer

通常AP还需要创建一些定时器即timer，如：getitimer()/ setitimer()：这些我们通常叫做Interval Timer，简称itimer，是指定时器采用“间隔”值（interval）作为计时方式，当定时器启动后，间隔值interval将不断减小。当 interval值减到0时，我们就说该间隔定时器到期，到期之后，内核一般会发送相应的signal给相应的进程。如果不删除itimer，itimer会周期性的到期并发送信号。

具体itimer可以使用不同的time base，具体如下：

◆ ITIMER_REAL：以ITIMER_REAL为time base的itimer在启动后，不管进程是否运行，不管是运行在内核态还是用户态，每个时钟滴答都将其间隔计数器减1。当减到0值时，内核向进程发送SIGALRM信号。

◆ ITIMER_VIRT：以ITIMER_VIRT为time base的itimer在启动后，只有该timer的owner进程是运行在用户态的时候，每个时钟滴答才将其间隔计数器减1。当减到0值时，内核向进程发送SIGVTALRM信号。

◆ ITIMER_PROF：以ITIMER_PROF为time base的itimer在启动后，只有在该timer的owner进程处于运行状态（不管是在用户态还是通过系统调用进入内核态）的时候，每个时钟滴答才将其间隔计数器减1。当减到0值时，内核向进程发送SIGPROF信号。

1.2 用top-down的方法从gettimeofday出发搞清楚Linux time的来龙去脉

1.2.1 gettimeofday的全过程

gettimeofday（C lib function）=>> sys_gettimeofday=>>do_gettimeofday=>>__get_realtime_clock_ts：

static inline void __get_realtime_clock_ts(struct timespec *ts)

{

unsigned long seq;

s64 nsecs;

do {

seq = read_seqbegin(&xtime_lock);

*ts = xtime;

nsecs = __get_nsec_offset();

} while (read_seqretry(&xtime_lock, seq));

timespec_add_ns(ts, nsecs);

}

xtime的定义如下：（kernel/time/timekeeping.c）

struct timespec xtime __attribute__ ((aligned (16)));

该structure的定义如下：

struct timespec {

time_t tv_sec; /* seconds */ //自从1970年1月1日0时0分0秒数

long tv_nsec; /* nanoseconds */

};

1.2.2 xtime何时被初始化

cold boot的时候用rtc time更新xtime

start_kernel () =>> rest_init() =>> kernel_init() =>> do_basic_setup() =>> do_initcalls()中会调用rtc_hctosys：

1. rtc_read_time获得当前的rtc time。

2. rtc_tm_to_time将当前的rtc time转换为自从1970年1月1日0时0分0秒数

3. do_settimeofday=>>set_normalized_timespec(&xtime, sec, nsec); 设置xitme

1.2.3 xtime何时被不断的随着时间的流逝而被准确的更新

Linux Kernel 2.6.23实现了多种Time subsystem的architecture，这些architectures是可以配置并实现的，目前No1和U3采用了相同的配置，即低精度timer和periedic ticks的结构，我们仅以为例子进行来龙去脉的说明，理解这一点再去理解其他的options就很容易了。

这个结构其实很简单，我们通过一个timer hardware每10ms来一个interrupt（这个interrupt叫做tick，我们也形象的称之为Linux heartbeat，心跳），在该irq handler中不断的去update xtime。

1.2.3.1 Clock source device和clock event device简介

在Linux Kernel 2.6.23中我们将clock源定义为一个设备叫做clock source：

struct clocksource，详见include/linux/clocksource.h。

而没产生一个clock event（也叫tick）以及相关的处理定义在：

/**

* struct clock_event_device - clock event device descriptor

* @name: ptr to clock event name

* @features: features

* @max_delta_ns: maximum delta value in ns

* @min_delta_ns: minimum delta value in ns

* @mult: nanosecond to cycles multiplier

* @shift: nanoseconds to cycles divisor (power of two)

* @rating: variable to rate clock event devices

* @irq: IRQ number (only for non CPU local devices)

* @cpumask: cpumask to indicate for which CPUs this device works

* @set_next_event: set next event function

* @set_mode: set mode function

* @event_handler: Assigned by the framework to be called by the low

* level handler of the event source

* @broadcast: function to broadcast events

* @list: list head for the management code

* @mode: operating mode assigned by the management code

* @next_event: local storage for the next event in oneshot mode

struct clock_event_device {

const char *name;

unsigned int features;

unsigned long max_delta_ns;

unsigned long min_delta_ns;

unsigned long mult;

int shift;

int rating;

int irq;

cpumask_t cpumask;

int (*set_next_event)(unsigned long evt,

struct clock_event_device *);

void (*set_mode)(enum clock_event_mode mode,

struct clock_event_device *);

void (*event_handler)(struct clock_event_device *);

void (*broadcast)(cpumask_t mask);

struct list_head list;

enum clock_event_mode mode;

ktime_t next_event;

};

1.2.3.2 一个clock event的处理以及来龙去脉

下面我们来详细看看某SOC ARM Linux中每10ms来一个interrupt（tick）的处理：

static irqreturn_t

xxx_gpta_interrupt(int irq, void *dev_id)

{

struct clock_event_device *c = dev_id;

if (c->mode == CLOCK_EVT_MODE_ONESHOT) {

/*Disable the timer A, but don't affect the other bits in TimerA control register*/

/*But we need not disable the TimerA because when timer interrupt come out

* the timer will be disabled by XXX.

* I got this conclusion from testing in bootloader code */

//GPTCTLA &= ~GPT_CR_EN;

/*signal the event*/

/*we don't do it as in CLOCK_EVT_MODE_PERIODIC (by "do while"

* because the similar mechanism is implemented in tick_handle_periodic ) */

c->event_handler(c);

} else if (c->mode == CLOCK_EVT_MODE_PERIODIC) {

do {

/*Clear the TimerA status, write one to clear please refer to XXX spec page6-186*/

GPTSR = GPT_SR_CLRA;

c->event_handler(c);

} while (((signed long)(GPTA_ENDCOUNT - GPTA_COUNT) <= MIN_ADCOUNT_DELTA)

&& (c->mode == CLOCK_EVT_MODE_PERIODIC));

}

return IRQ_HANDLED;

}

我们重点请看上面红色highlight出来的两行，简言之就是执行dev_id->event_handler。那么dev_id从何而来呢？还记得我们在讲request_irq接口的时候曾经讲到dev_id参数么？这里为了讲这个问题说明的更加清楚明白，我们再看一次request_irq的source code：

int request_irq(unsigned int irq, irq_handler_t handler,

unsigned long irqflags, const char *devname, void *dev_id)

{

struct irqaction *action;

。。。。。。

action->handler = handler;

action->flags = irqflags;

cpus_clear(action->mask);

action->name = devname;

action->next = NULL;

action->dev_id = dev_id;

retval = setup_irq(irq, action);

return retval;

}

再由xxx_timer_init=>> setup_irq(IRQ_GPTA, &xxx_gpta_irq);可知：这个dev_id就对应了：

static struct clock_event_device ckevt_xxx_gpta = {

.name = "ckevt_xxx_gpta",

.features = CLOCK_EVT_FEAT_PERIODIC | CLOCK_EVT_FEAT_ONESHOT,

.shift = 32,

.rating = 200,

.cpumask = CPU_MASK_CPU0,

.set_next_event = xxx_gpta_set_next_event,

.set_mode = xxx_gpta_set_mode,

};

以handle_level_irq为例：handle_level_irq=>> handle_IRQ_event =>> ret = action->handler(irq, action->dev_id);，现在该很清楚了。

但是问题是我们并没有发现关于event_handler的定义，那么它又从何而来呢？

◆ xxx_timer_init=>>clockevents_register_device=>>clockevents_do_notify: clockevents_chain（定义在：kernel/time/clockevents.c，static RAW_NOTIFIER_HEAD(clockevents_chain);）

◆ xxx_timer_init=>>clockevents_register_device=>> clockevents_do_notify(CLOCK_EVT_NOTIFY_ADD, dev); =>>raw_notifier_call_chain=>>__raw_notifier_call_chain=>>notifier_call_chain=>>ret = nb->notifier_call(nb, val, v);

而tick_init=>>clockevents_register_notifier(&tick_notifier);=>>raw_notifier_chain_register向clockevents_chain中添加clock event：详见raw_notifier_chain_register=>>notifier_chain_register

其中：

static struct notifier_block tick_notifier = {

.notifier_call = tick_notify,

};

所以：notifier_call就是tick_notify。

◆ tick_notify=>> tick_check_new_device =>> tick_setup_device ：由于我们定义的是周期性的tick，所以再到：=>>tick_setup_periodic(newdev, 0); =>> tick_set_periodic_handler =>> dev->event_handler = tick_handle_periodic;

到此我们才清楚的知道在1.2.3.2中的c->event_handler(c)就是tick_handle_periodic了。

tick_handle_periodic=>> tick_periodic =>> do_timer：

1. update jiffies

2. tick_handle_periodic=>> tick_periodic =>> do_timer =>>update_wall_time更新xtime

讨论：

1. Tick是否会丢失，什么情况下可能会丢失？如果丢失会如何？不应该有丢失的情况，如果有丢失也是我们具体驱动设计的bug。

2. 再看：xxx_gpta_interrupt，简单说明为何要do-while

3. 再看__get_realtime_clock_ts，如果我们要实现high resolution的time，最简单的方法是什么？

目前的做法如下：（而clocksource_jiffies的精度仅仅10ms）

◆ __get_realtime_clock_ts =>>__get_nsec_offset=>>clocksource_read

◆ timekeeping_init=>>clock = clocksource_get_next();=>>

◆ init_jiffies_clocksource=>>clocksource_register(&clocksource_jiffies);=>>

如果我们要high resolution的time，可以自己实现一个high resolution的clock source，但是往往为了简单起见就不实现了。

而我们通过clocksource_register可以register多个clock source，那么到底哪个起作用呢？这个取决于两方面:

1. 要看我们是否通过Linux Kernel init parameters来选择哪一个clock source：

详见：

static int __init boot_override_clocksource(char* str)

{

unsigned long flags;

spin_lock_irqsave(&clocksource_lock, flags);

if (str)

strlcpy(override_name, str, sizeof(override_name));

spin_unlock_irqrestore(&clocksource_lock, flags);

return 1;

}

__setup("clocksource=", boot_override_clocksource);

2. 当系统启动完成后，我们也可以通过sysfs来选择：向/sys/devices/system/clocksource/clocksource0/current_clocksource写入我们要选择的clocksource的name即可。其建立的过程有兴趣的同仁可以自行分析，这里简单提示以下：

A) system_bus_init()建立了/sys/devices/system/

B) init_clocksource_sysfs=>>int error = sysdev_class_register(&clocksource_sysclass);建立了/sys/devices/system/clocksource/

C) init_clocksource_sysfs=>>error = sysdev_register(&device_clocksource);建立了/sys/devices/system/clocksource/clocksource0/

D) init_clocksource_sysfs=>>error = sysdev_create_file(&device_clocksource,&attr_current_clocksource);建立了/sys/devices/system/clocksource/clocksource0/current_clocksource

3. 如果前两步都没有进行选择的话，那么将取决于clocksource的rating，最大的就是当前要使用的。

1.2.4 当从deep sleep被唤醒后xtime又如何被更新

系统从deep sleep中被唤醒时通过rtc_resume用rtc time更新xtime，请自己分析源码，而rtc_resume何时被调用，我们将在power management中分析。

讨论

4. RTC时间的精度问题和校准

5. Application process的时间精度和real time问题的讨论，R3提出的real time需求能够在App process级别来满足么？

1.3 Linux Kernel 2.6.23支持的其它时间子系统的架构简介

这里我们就不讲了，以后有机会可以进行分享。

1.4 Process相关的时间属性更新

tick_handle_periodic=>> tick_periodic =>> update_process_times：

1. 更新process相关的时间，CPU使用的一些状态信息等。如current process的utime（User Mode下的执行时间），stime（Kernel Mode下的执行时间），CPU状态信息（参看account_system_time）

2. run_local_timers检查是否有timer expire，有就执行其handler。这里包括了由setitimer()建立起来的hrtimer（raise_softirq(TIMER_SOFTIRQ);，来源于：init_timers =>> open_softirq(TIMER_SOFTIRQ, run_timer_softirq, NULL); 关于softirq的来龙去脉我们会在第9章中详细介绍）。

3. scheduler_tick检查该process的time slice counter，看看该process的time是否用完等，这个跟scheduler有关。

讨论：

6. 我们为什么可以看到process占用CPU的时间？CPU使用率？如top命令，或ps –aux可以看到某个process使用CPU的时间

1.4.1 从scheduler_tick来简单看看Linux 2.6.23的process scheduling

在本章节的交流中，有同仁对Linux process scheduling比较感兴趣，原本我并不打算去分析scheduler_tick做了些啥，既然有同仁提出，我想还是从scheduler_tick来简单分析一下Linux 2.6.23的process scheduling机制。

我们抛开一些细部的东西，scheduler_tick主要是执行不同的scheduler class相应的处理：curr->sched_class->task_tick(rq, curr);

目前Linux 2.6.23主要支持两个scheduler class，一个是服务于real time processes（SCHED_FI FO或SCHED_RR）的Real-Time Scheduling Class，另一个是服务于normal process（SCHED_NORMAL）或batch process（SCHED_BATCH）的Completely Fair Scheduling (CFS) Class

Real-Time Scheduling Class定义在kernel/sched_rt.c：

static struct sched_class rt_sched_class __read_mostly = {

。。。。。。

.pick_next_task = pick_next_task_rt,

。。。。。。

.task_tick = task_tick_rt,

};

Completely Fair Scheduling (CFS) Class定义在kernel/sched_fair.c中：

struct sched_class fair_sched_class __read_mostly = {

。。。。。。。。。。。。。。。。

.pick_next_task = pick_next_task_fair,

。。。。。。。。。。。。。。。。

.task_tick = task_tick_fair,

。。。。。。。。。。。。。

};

这里对应了两个不同的schedule策略

1.4.1.1 Real time processes的schedule策略：task_tick_rt

1. 如果该current process的分配的time slice未用完，则其继续执行

2. 如果time slices用完了，则重新分配time slices，并判断是否有其它real time processes，有则重新schedule（set_tsk_need_resched(p);即设置_TIF_NEED_RESCHED，其含义为要求系统在从Kernel Space返回到User Space之前进行process rescheduling），如果没有其它的real time process，current real time process将继续执行。

3. 当xxx_gpta_interrupt完成中断处理后，如果要返回user space的话，它就会进行process rescheduling。其对应的source code如下：

arch/arm/kernel/entry-armv.S中：__irq_usr=>>ret_to_user，而ret_to_user定义在arch/arm/kernel/entry-common.S中：ret_to_user（其实就是ret_slow_syscall）=>>work_pending=>>work_resched=>>bl schedule。

讨论：

7. sys_sched_setscheduler可以将一个process设置为real time process，并设定其priority。

8. 由task_tick_rt的source code，我们仔细分析一下，有如下结论：

A. 如果是SCHED_FIFO real time process，它一旦被调度执行，那么它将一直占住CPU，直到其退出，所以SCHED_FIFO real time process没有time slice的概念。这个跟其first in first out的定义也是相符的。

B. 如果是SCHED_RR real time process，一旦被调度起来，只有当有同优先级的process，且其time slices用完后才reschedule。（这里我认为是一个缺陷）

C. 基于上述A，B，我们对real time process一定要慎用，使用不当会导致系统无法工作。

1.4.1.2 Normal processes的schedule策略：task_tick_fair

task_tick_fair =>>entity_tick：

1. __pick_next_entity，找一个normal process，CFS调度算法其实也很简单，就是找一个已经等待了最长时间的process。谁等的久谁就优先执行，这就是所谓的complete fair了。

2. __check_preempt_curr_fair=>> resched_task=>> set_tsk_need_resched

3. 同上

1.4.1.3 Schedule()简介

无论schedule()何时被执行，其需要再次选择一个process来执行：pick_next_task

1. 首先就是选择一个real time process，其选择的方法即pick_next_task_rt

2. 如果找不到一个real time process，才会选择一个normal process，其选择方法为：pick_next_task_fair