Linux电源管理（四）CPUFreq

来源：互联网发布：酒店网络评价回复编辑：程序博客网时间：2024/04/28 19:00

CPUFreq简介

CPUFreq是一种实时的电压和频率调节技术，也叫DVFS（Dynamic Voltage and Frequency Scaling）动态电压频率调节。

为何需要CPUFreq

随着技术的发展，CPU的频率越来越高，性能越来越好，芯片制造工艺也越来越先进。但高性能的同时也带来高发热。其实移动嵌入式设备并不需要时刻保持高性能。因此，需要一种机制，实现动态地调节频率和电压，以实现性能和功耗的平衡。

CPUFreq软件框架

和一般的linux子系统类似，CPUFreq采用了机制与策略分离的设计架构。分为三个模块：

cpufreq core：对cpufreq governors和cpufreq drivers进行了封装和抽象并定义了清晰的接口，从而在设计上完成了对机制和策略的分离。
cpufreq drivers：位于cpucore的底层，用于设置具体cpu硬件的频率。通过cpufreq driver可以使cpu频率得到调整。cpufreq driver借助Linux Cpufreq标准子系统中的cpufreq_driver结构体，完成cpu调频驱动的注册及实现。
cpufreq governor：位于cpucore的上层，用于CPU升降频检测，根据系统和负载，决定cpu频率要调节到多少。cpufreq governor借助于linux cpufreq子系统中cpufreq_governor结构体，完成了cpu调频策略的注册和实现。

CPUFreq实现原理

linux cpufreq通过向系统注册实现cpufreq driver和cpufreq governor。cpu governor实现调频的策略，cpu driver实现调频的实际操作，从而完成动态调节频率和电压。一般情况下，优先调节频率，频率无法满足，再调节电压以实现调频。

CPUFreq sys用户态接口

cpufreq相关的节点位于/sys/devices/system/cpu/cpu0/cpufreq目录下：

$ cd /sys/devices/system/cpu/cpu0/cpufreq

可以看到以下节点：

shell@tiny4412:/sys/devices/system/cpu/cpu0/cpufreq # ls
affected_cpus
cpuinfo_cur_freq
cpuinfo_max_freq
cpuinfo_min_freq
cpuinfo_transition_latency
related_cpus
scaling_available_governors
scaling_cur_freq
scaling_driver
scaling_governor
scaling_max_freq
scaling_min_freq
scaling_setspeed
stats

具体含义如下表：
这里写图片描述

CPUFreq实现分析　

CPUFreq Core层

CPUFreq子系统将一些共同的逻辑代码组织在一起，构成了CPUFreq核心模块。这些公共逻辑模块向CPUFreq和其它内核模块提供了必要的API完成一个完整的CPUFreq子系统。这一节我们分析CPUFreq核心层的一些重要API的实现及使用。

代码位置：

/drivers/cpufreq/cpufreq.c

CPUFreq子系统初始化

static int __init cpufreq_core_init(void){    int cpu;    if (cpufreq_disabled())        return -ENODEV;    for_each_possible_cpu(cpu) {        per_cpu(cpufreq_policy_cpu, cpu) = -1;        init_rwsem(&per_cpu(cpu_policy_rwsem, cpu));    }    cpufreq_global_kobject = kobject_create_and_add("cpufreq", &cpu_subsys.dev_root->kobj);    BUG_ON(!cpufreq_global_kobject);#if defined(CONFIG_ARCH_SUNXI) && defined(CONFIG_HOTPLUG_CPU)    /* register reboot notifier for process cpus when reboot */    register_reboot_notifier(&reboot_notifier);#endif    return 0;}core_initcall(cpufreq_core_init);

可见，CPUFreq子系统在系统启动阶段由Initcall机制调用完成核心部分的初始化工作。cpufreq_policy_cpu是一个per_cpu变量，在smp系统下，每个cpu可以有自己独立的policy，也可以与其它cpu共用一个policy。通过kobject_create_and_add函数建立cpufreq节点，这与我们之前看到的sys下的cpufreq节点相吻合。该节点以后会用来放其它一些参数。
参数cpu_subsys是内核的一个全局变量，是由更早期的初始化时初始化的，代码在drivers/base/cpu.c中：

struct bus_type cpu_subsys = {    .name = "cpu",    .dev_name = "cpu",};EXPORT_SYMBOL_GPL(cpu_subsys);void __init cpu_dev_init(void){    if (subsys_system_register(&cpu_subsys, cpu_root_attr_groups))        panic("Failed to register CPU subsystem");    cpu_dev_register_generic();}

这将会建立一根cpu总线，总线下挂着系统中所有的cpu，cpu总线设备的根目录就位于：/sys/devices/system/cpu，同时，/sys/bus下也会出现一个cpu的总线节点。cpu总线设备的根目录下会依次出现cpu0，cpu1，…… cpux节点，每个cpu对应其中的一个设备节点。CPUFreq子系统利用这个cpu_subsys来获取系统中的cpu设备，并在这些cpu设备下面建立相应的cpufreq对象，这个我们在后面再讨论。
这样看来，cpufreq子系统的初始化其实没有做什么重要的事情，只是初始化了几个per_cpu变量和建立了一个cpufreq文件节点。下图是初始化过程的序列图：
这里写图片描述

注册cpufreq_governor

系统中可以同时存在多个governor策略，一个policy通过cpufreq_policy结构中的governor指针和某个governor相关联。要想一个governor被policy使用，首先要把该governor注册到cpufreq的核心中，我们可以通过核心层提供的API来完成注册：

int cpufreq_register_governor(struct cpufreq_governor *governor){    int err;    if (!governor)        return -EINVAL;    if (cpufreq_disabled())        return -ENODEV;    mutex_lock(&cpufreq_governor_mutex);    governor->initialized = 0;    err = -EBUSY;    if (__find_governor(governor->name) == NULL) {        err = 0;        list_add(&governor->governor_list, &cpufreq_governor_list);    }    mutex_unlock(&cpufreq_governor_mutex);    return err;}

核心层定义了一个全局链表变量：cpufreq_governor_list，注册函数首先根据governor的名称，通过__find_governor()函数查找该governor是否已經被注册过，如果没有被注册过，则把代表该governor的结构体添加到cpufreq_governor_list链表中。

注册cpufreq_driver驱动

与governor不同，系统中只会存在一个cpufreq_driver驱动，cpufreq_driver是平台相关的，负责最终实施频率的调整动作，而选择工作频率的策略是由governor完成的。所以，系统中只需要注册一个cpufreq_driver即可，它只负责如何控制该平台的时钟系统，从而设定由governor确定的工作频率。核心提供了一个API：cpufreq_register_driver来完成注册工作。
下面我们分析一下这个函数的工作过程：

int cpufreq_register_driver(struct cpufreq_driver *driver_data){    unsigned long flags;    int ret;    if (cpufreq_disabled())        return -ENODEV;    // 从代码可以看到，verify和init回调函数必须要实现，而setpolicy和target回调则至少要被实现其中的一个。    if (!driver_data || !driver_data->verify || !driver_data->init ||        ((!driver_data->setpolicy) && (!driver_data->target)))        return -EINVAL;    pr_debug("trying to register driver %s\n", driver_data->name);    if (driver_data->setpolicy)        driver_data->flags |= CPUFREQ_CONST_LOOPS;    write_lock_irqsave(&cpufreq_driver_lock, flags);    //检查全局变量cpufreq_driver是否已经被赋值，如果没有，则传入的参数被赋值给全局变量cpufreq_driver，从而保证了系统中只会注册一个cpufreq_driver驱动    if (cpufreq_driver) {        write_unlock_irqrestore(&cpufreq_driver_lock, flags);        return -EBUSY;    }    cpufreq_driver = driver_data;    write_unlock_irqrestore(&cpufreq_driver_lock, flags);    //通过subsys_interface_register给每一个cpu建立一个cpufreq_policy    ret = subsys_interface_register(&cpufreq_interface);    if (ret)        goto err_null_driver;    if (!(cpufreq_driver->flags & CPUFREQ_STICKY)) {        int i;        ret = -ENODEV;        /* check for at least one working CPU */        for (i = 0; i < nr_cpu_ids; i++)            if (cpu_possible(i) && per_cpu(cpufreq_cpu_data, i)) {                ret = 0;                break;            }        /* if all ->init() calls failed, unregister */        if (ret) {            pr_debug("no CPU initialized for driver %s\n",                            driver_data->name);            goto err_if_unreg;        }    }    //注册cpu hot plug通知，以便在cpu hot plug的时候，能够动态地处理各个cpu policy之间的关系（比如迁移负责管理的cpu等等）    register_hotcpu_notifier(&cpufreq_cpu_notifier);    pr_debug("driver %s up and running\n", driver_data->name);    return 0;err_if_unreg:    subsys_interface_unregister(&cpufreq_interface);err_null_driver:    write_lock_irqsave(&cpufreq_driver_lock, flags);    cpufreq_driver = NULL;    write_unlock_irqrestore(&cpufreq_driver_lock, flags);    return ret;}

cpufreq_interface结构体如下：

 static struct subsys_interface cpufreq_interface = {    .name       = "cpufreq",    .subsys     = &cpu_subsys,    .add_dev    = cpufreq_add_dev,    .remove_dev = cpufreq_remove_dev,};

subsys_interface_register遍历子系统下面的每一个子设备，然后用这个子设备作为参数，调用cpufrq_interface结构的add_dev回调函数，这里的回调函数被指向了cpufreq_add_dev。

下图是cpufreq_driver注册过程的序列图：
这里写图片描述

通过__cpufreq_set_policy函数，最终使得该policy正式生效。到这里，每个cpu的policy已经建立完毕，并正式开始工作。

__cpufreq_set_policy函数时序图如下：

这里写图片描述

其它API

int cpufreq_register_notifier(struct notifier_block *nb, unsigned int list);
int cpufreq_unregister_notifier(struct notifier_block *nb, unsigned int list);

以上两个API用于注册和注销cpufreq系统的通知消息，第二个参数可以选择通知的类型，可以有以下两种类型：

CPUFREQ_TRANSITION_NOTIFIER 收到频率变更通知

CPUFREQ_POLICY_NOTIFIER 收到policy更新通知

cpufreq_driver_target:用来设置目标频率，实际回调cpufreq的target函数。

int __cpufreq_driver_target(struct cpufreq_policy *policy,                unsigned int target_freq,                unsigned int relation){    int retval = -EINVAL;    unsigned int old_target_freq = target_freq;    if (cpufreq_disabled())        return -ENODEV;    /* Make sure that target_freq is within supported range */    if (target_freq > policy->max)        target_freq = policy->max;    if (target_freq < policy->min)        target_freq = policy->min;    pr_debug("target for CPU %u: %u kHz, relation %u, requested %u kHz\n",            policy->cpu, target_freq, relation, old_target_freq);    if (target_freq == policy->cur)        return 0;    if (cpufreq_driver->target)        retval = cpufreq_driver->target(policy, target_freq, relation);    return retval;}

CPUFreq driver层

通常一个驱动工程师驱动需要实现是大多是cpufreq driver,这部有具体的cpu差异。cpufreq driver主要完成平台相关的CPU频率/电压的控制，它在cpufreq framework中是非常简单的一个模块，主要是定义一个struct cpufreq_driver变量，填充必要的字段，并根据平台的特性，实现其中的回调函数。然后注册到系统中去。
cpufreq_driver 结构体如下所示。

struct cpufreq_driver {    struct module           *owner;  //一般这THIS_MODULE    char            name[CPUFREQ_NAME_LEN]; //cpufreq driver名字，如"cpufreq-sunxi"    u8          flags; //标志：可以设置一些值，如CPUFREQ_STICKY，表示就算所有的init调用都失败了，driver也不被remove。    bool            have_governor_per_policy;    /* needed by all drivers */    int (*init)     (struct cpufreq_policy *policy); //必须实现，用于在cpufreq core在cpu device添加后运行    int (*verify)   (struct cpufreq_policy *policy); //必须实现，在当上层软件需要设定一个新的policy的时候，会调用driver的verify回调函数，检查该policy是否合法    /* define one out of two */    int (*setpolicy)    (struct cpufreq_policy *policy); //一般不实现     int (*target)   (struct cpufreq_policy *policy, //实际的调频函数                 unsigned int target_freq,                 unsigned int relation);    /* should be defined, if possible */    unsigned int    (*get)  (unsigned int cpu); //用于获取指定cpu的频率值    /* optional */    unsigned int (*getavg)  (struct cpufreq_policy *policy,                 unsigned int cpu);    int (*bios_limit)   (int cpu, unsigned int *limit);    int (*exit)     (struct cpufreq_policy *policy);    int (*suspend)  (struct cpufreq_policy *policy);    int (*resume)   (struct cpufreq_policy *policy);    struct freq_attr    **attr;}

下面例子填充并实现cpufreq_driver结构体中这些必要成员。

static struct cpufreq_driver sunxi_cpufreq_driver = {    .name   = "cpufreq-sunxi",    .flags  = CPUFREQ_STICKY,    .init   = sunxi_cpufreq_init,    .verify = sunxi_cpufreq_verify,    .target = sunxi_cpufreq_target,    .get    = sunxi_cpufreq_get,    .attr   = sunxi_cpufreq_attr,};

先看一下init函数，init函数主要完成从device tree里获取对应的clock，regulator配置最大最小频率等。
device tree配置如下：

cpu@0 {    device_type = "cpu";    compatible = "arm,cortex-a53","arm,armv8";    reg = <0x0 0x0>;    enable-method = "psci";    cpufreq_tbl = < 480000            648000            720000            816000            912000            1008000            1104000            1152000            1200000>;    clock-latency = <2000000>;    clock-frequency = <1008000000>;    cpu-idle-states = <&CPU_SLEEP_0 &CLUSTER_SLEEP_0 &SYS_SLEEP_0>;};

Init函数如下：

static int __init sunxi_cpufreq_initcall(void){    struct device_node *np;    const struct property *prop;    struct cpufreq_frequency_table *freq_tbl;    const __be32 *val;    int ret, cnt, i;    np = of_find_node_by_path("/cpus/cpu@0");    if (!np) {        CPUFREQ_ERR("No cpu node found\n");        return -ENODEV;    }    if (of_property_read_u32(np, "clock-latency",                    &sunxi_cpufreq.transition_latency))        sunxi_cpufreq.transition_latency = CPUFREQ_ETERNAL;    prop = of_find_property(np, "cpufreq_tbl", NULL);    if (!prop || !prop->value) {        CPUFREQ_ERR("Invalid cpufreq_tbl\n");        ret = -ENODEV;        goto out_put_node;    }    cnt = prop->length / sizeof(u32);    val = prop->value;    freq_tbl = kmalloc(sizeof(*freq_tbl) * (cnt + 1), GFP_KERNEL);    if (!freq_tbl) {        ret = -ENOMEM;        goto out_put_node;    }    for (i = 0; i < cnt; i++) {        freq_tbl[i].index = i;        freq_tbl[i].frequency = be32_to_cpup(val++);    }    freq_tbl[i].index = i;    freq_tbl[i].frequency = CPUFREQ_TABLE_END;    sunxi_cpufreq.freq_table = freq_tbl;#ifdef CONFIG_DEBUG_FS    sunxi_cpufreq.cpufreq_set_us = 0;    sunxi_cpufreq.cpufreq_get_us = 0;#endif    sunxi_cpufreq.last_freq = ~0;    sunxi_cpufreq.clk_pll = clk_get(NULL, PLL_CPU_CLK);    if (IS_ERR_OR_NULL(sunxi_cpufreq.clk_pll)) {        CPUFREQ_ERR("Unable to get PLL CPU clock\n");        ret = PTR_ERR(sunxi_cpufreq.clk_pll);        goto out_err_clk_pll;    }    sunxi_cpufreq.clk_cpu = clk_get(NULL, CPU_CLK);    if (IS_ERR_OR_NULL(sunxi_cpufreq.clk_cpu)) {        CPUFREQ_ERR("Unable to get CPU clock\n");        ret = PTR_ERR(sunxi_cpufreq.clk_cpu);        goto out_err_clk_cpu;    }    sunxi_cpufreq.vdd_cpu = regulator_get(NULL, CPU_VDD);    if (IS_ERR_OR_NULL(sunxi_cpufreq.vdd_cpu)) {        CPUFREQ_ERR("Unable to get CPU regulator\n");        ret = PTR_ERR(sunxi_cpufreq.vdd_cpu);        /* do not return error even if error*/    }    /* init cpu frequency from dt */    ret = __init_freq_dt();    if (ret == -ENODEV#ifdef CONFIG_CPU_VOLTAGE_SCALING        || ret == -EINVAL#endif    )        goto out_err_dt;    pr_debug("[cpufreq] max: %uMHz, min: %uMHz, ext: %uMHz, boot: %uMHz\n",                sunxi_cpufreq.max_freq / 1000, sunxi_cpufreq.min_freq / 1000,                sunxi_cpufreq.ext_freq / 1000, sunxi_cpufreq.boot_freq / 1000);#ifdef CONFIG_CPU_VOLTAGE_SCALING    __vftable_show();    sunxi_cpufreq.last_vdd = sunxi_cpufreq_getvolt();#endif    mutex_init(&sunxi_cpufreq.lock);    ret = cpufreq_register_driver(&sunxi_cpufreq_driver);    if (ret) {        CPUFREQ_ERR("failed register driver\n");        goto out_err_register;    } else {        goto out_put_node;    }out_err_register:    mutex_destroy(&sunxi_cpufreq.lock);out_err_dt:    if (!IS_ERR_OR_NULL(sunxi_cpufreq.vdd_cpu)) {        regulator_put(sunxi_cpufreq.vdd_cpu);    }    clk_put(sunxi_cpufreq.clk_cpu);out_err_clk_cpu:    clk_put(sunxi_cpufreq.clk_pll);out_err_clk_pll:    kfree(freq_tbl);out_put_node:    of_node_put(np);    return ret;}

从上面可以看出，init函数主要的工作是从device tree中获取资源并配置最大最小频率等，然后注册一个cpufreq驱动。

下看看一下cpufreq_frequency_table_verify的实现，该函数主要是确保在policy->min和policy->max之间至少有一个有效
频率，并且所有其他的指标都符合。

static int sunxi_cpufreq_verify(struct cpufreq_policy *policy){    return cpufreq_frequency_table_verify(policy, sunxi_cpufreq.freq_table);}

get函数主要是获取当前cpu频率。

static unsigned int sunxi_cpufreq_get(unsigned int cpu){    unsigned int current_freq = 0;#ifdef CONFIG_DEBUG_FS    ktime_t calltime = ktime_get();#endif    clk_get_rate(sunxi_cpufreq.clk_pll);    current_freq = clk_get_rate(sunxi_cpufreq.clk_cpu) / 1000;#ifdef CONFIG_DEBUG_FS    sunxi_cpufreq.cpufreq_get_us = ktime_to_us(ktime_sub(ktime_get(), calltime));#endif    return current_freq;}

target是实现调频调压的操作者。

static int sunxi_cpufreq_target(struct cpufreq_policy *policy,                    __u32 freq, __u32 relation){    int ret = 0;    unsigned int            index;    struct cpufreq_freqs    freqs;#ifdef CONFIG_DEBUG_FS    ktime_t calltime;#endif#ifdef CONFIG_SMP    int i;#endif#ifdef CONFIG_CPU_VOLTAGE_SCALINGunsigned int new_vdd;#endif    mutex_lock(&sunxi_cpufreq.lock);    /* avoid repeated calls which cause a needless amout of duplicated     * logging output (and CPU time as the calculation process is     * done) */    if (freq == sunxi_cpufreq.last_freq)        goto out;    CPUFREQ_DBG(DEBUG_FREQ, "request frequency is %uKHz\n", freq);    if (unlikely(sunxi_boot_lock))        freq = freq > sunxi_cpufreq.boot_freq ? sunxi_cpufreq.boot_freq : freq;    /* try to look for a valid frequency value from cpu frequency table */    if (cpufreq_frequency_table_target(policy, sunxi_cpufreq.freq_table,                    freq, relation, &index)) {        CPUFREQ_ERR("try to look for %uKHz failed!\n", freq);        ret = -EINVAL;        goto out;    }    /* frequency is same as the value last set, need not adjust */    if (sunxi_cpufreq.freq_table[index].frequency == sunxi_cpufreq.last_freq)        goto out;    freq = sunxi_cpufreq.freq_table[index].frequency;    CPUFREQ_DBG(DEBUG_FREQ, "target is find: %uKHz, entry %u\n", freq, index);    /* notify that cpu clock will be adjust if needed */    if (policy) {        freqs.cpu = policy->cpu;        freqs.old = sunxi_cpufreq.last_freq;        freqs.new = freq;#ifdef CONFIG_SMP        /* notifiers */        for_each_cpu(i, policy->cpus) {            freqs.cpu = i;            cpufreq_notify_transition(policy, &freqs, CPUFREQ_PRECHANGE);        }#else        cpufreq_notify_transition(policy, &freqs, CPUFREQ_PRECHANGE);#endif    }#ifdef CONFIG_CPU_VOLTAGE_SCALING    /* get vdd value for new frequency */    new_vdd = __get_vdd_value(freq * 1000);    CPUFREQ_DBG(DEBUG_FREQ, "set cpu vdd to %dmv\n", new_vdd);    if (!IS_ERR_OR_NULL(sunxi_cpufreq.vdd_cpu) && (new_vdd > sunxi_cpufreq.last_vdd)) {        CPUFREQ_DBG(DEBUG_FREQ, "set cpu vdd to %dmv\n", new_vdd);        if (regulator_set_voltage(sunxi_cpufreq.vdd_cpu, new_vdd*1000, new_vdd*1000)) {            CPUFREQ_ERR("try to set cpu vdd failed!\n");            /* notify everyone that clock transition finish */            if (policy) {                freqs.cpu = policy->cpu;;                freqs.old = freqs.new;                freqs.new = sunxi_cpufreq.last_freq;#ifdef CONFIG_SMP                /* notifiers */                for_each_cpu(i, policy->cpus) {                    freqs.cpu = i;                    cpufreq_notify_transition(policy, &freqs, CPUFREQ_POSTCHANGE);                }#else                cpufreq_notify_transition(policy, &freqs, CPUFREQ_POSTCHANGE);#endif            }            return -EINVAL;        }    }#endif#ifdef CONFIG_DEBUG_FS    calltime = ktime_get();#endif    /* try to set cpu frequency */#ifndef CONFIG_SUNXI_ARISC    if (__set_cpufreq_by_ccu(freq))#else    if (arisc_dvfs_set_cpufreq(freq, ARISC_DVFS_PLL1, ARISC_DVFS_SYN, NULL, NULL))#endif    {        CPUFREQ_ERR("set cpu frequency to %uKHz failed!\n", freq);#ifdef CONFIG_CPU_VOLTAGE_SCALING        if (!IS_ERR_OR_NULL(sunxi_cpufreq.vdd_cpu) && (new_vdd > sunxi_cpufreq.last_vdd)) {            if (regulator_set_voltage(sunxi_cpufreq.vdd_cpu,                    sunxi_cpufreq.last_vdd*1000, sunxi_cpufreq.last_vdd*1000)) {                CPUFREQ_ERR("try to set voltage failed!\n");                sunxi_cpufreq.last_vdd = new_vdd;            }        }#endif        /* set cpu frequency failed */        if (policy) {            freqs.cpu = policy->cpu;            freqs.old = freqs.new;            freqs.new = sunxi_cpufreq.last_freq;#ifdef CONFIG_SMP            /* notifiers */            for_each_cpu(i, policy->cpus) {                freqs.cpu = i;                cpufreq_notify_transition(policy, &freqs, CPUFREQ_POSTCHANGE);            }#else            cpufreq_notify_transition(policy, &freqs, CPUFREQ_POSTCHANGE);#endif        }        ret = -EINVAL;        goto out;    }#ifdef CONFIG_DEBUG_FS    sunxi_cpufreq.cpufreq_set_us = ktime_to_us(ktime_sub(ktime_get(), calltime));#endif#ifdef CONFIG_CPU_VOLTAGE_SCALING    if(sunxi_cpufreq.vdd_cpu && (new_vdd < sunxi_cpufreq.last_vdd)) {        CPUFREQ_DBG(DEBUG_FREQ, "set cpu vdd to %dmv\n", new_vdd);        if(regulator_set_voltage(sunxi_cpufreq.vdd_cpu, new_vdd*1000, new_vdd*1000)) {            CPUFREQ_ERR("try to set voltage failed!\n");            new_vdd = sunxi_cpufreq.last_vdd;        }    }    sunxi_cpufreq.last_vdd = new_vdd;#endif    /* notify that cpu clock will be adjust if needed */    if (policy) {#ifdef CONFIG_SMP        for_each_cpu(i, policy->cpus) {            freqs.cpu = i;            cpufreq_notify_transition(policy, &freqs, CPUFREQ_POSTCHANGE);        }#else        cpufreq_notify_transition(policy, &freqs, CPUFREQ_POSTCHANGE);#endif    }    sunxi_cpufreq.last_freq = freq;    CPUFREQ_DBG(DEBUG_FREQ, "DVFS done! Freq[%uMHz] Volt[%umv] ok\n", \            sunxi_cpufreq_get(0) / 1000, sunxi_cpufreq_getvolt());out:    mutex_unlock(&sunxi_cpufreq.lock);    return ret;}

代码比较较容易理解，这里不再分析，流程图如下：
这里写图片描述

CPUFreq governor层

上面提到过,governor的作用是根据系统的负载,检测系统的负载状况，然后根据当前的负载，选择出某个可供使用的工作频率，然后把该工作频率传递给cpufreq_driver，完成频率的动态调节。内核默认提供了5种governor供我们使用.
- Performance: 性能优先的governor，直接将cpu频率设置为policy->{min,max}中的最大值。一般会被选做默认的governor以节省系统启动时间,之后再切换.
- Powersave:功耗优先的governor，直接将cpu频率设置为policy->{min,max}中的最小值。
- Userspace: 由用户空间程序通过scaling_setspeed文件修改频率。一般用作调试。
- Ondemand：根据CPU的当前使用率，动态的调节CPU频率。
- interactive: 交互式动态调节CPU频率，与Ondemand类似，由谷歌开发并广泛使用于手机平板等设备上。本文主要讨论该governor。
我们看一下cpufreq_governor结构体：

struct cpufreq_governor {    char    name[CPUFREQ_NAME_LEN]; //governor的名字，这里被赋值为interactive    int initialized; //初始化标志位    int (*governor) (struct cpufreq_policy *policy,                 unsigned int event);  //这个calback用于控制governor的行为，比较重要，是governor的一个去切入点    ssize_t (*show_setspeed)    (struct cpufreq_policy *policy,                     char *buf);    int (*store_setspeed)   (struct cpufreq_policy *policy,                     unsigned int freq);    unsigned int max_transition_latency; /* HW must be able to switch to            next freq faster than this value in nano secs or we            will fallback to performance governor */    struct list_head    governor_list; //所有注册的governor都会被add到这个链表里面    struct module       *owner;};

定义一个governor如下：

struct cpufreq_governor cpufreq_gov_interactive = {    .name = "interactive",    .governor = cpufreq_governor_interactive,    .max_transition_latency = 10000000,    .owner = THIS_MODULE,};

governor是这个结构的核心字段，cpufreq_governor注册后，cpufreq的核心层通过该字段操纵这个governor的行为，包括：初始化、启动、退出等工作。

一个governor如何被初始化的？
当一个governor被policy选定后，核心层会通过 __ufreq_set_policy函数对该cpu的policy进行设定。如果policy认为这是一个新的governor（和原来使用的旧的governor不相同），policy会通过__cpufreq_governor函数，并传递CPUFREQ_GOV_POLICY_INIT参数，而__cpufreq_governor函数实际上是调用cpufreq_governor结构中的governor回调函数。
下面是它收到CPUFREQ_GOV_POLICY_INIT参数时的代码片段：

    case CPUFREQ_GOV_POLICY_INIT:        if (have_governor_per_policy()) {            WARN_ON(tunables);        } else if (tunables) {            tunables->usage_count++;            policy->governor_data = tunables;            return 0;        }        tunables = kzalloc(sizeof(*tunables), GFP_KERNEL);        if (!tunables) {            pr_err("%s: POLICY_INIT: kzalloc failed\n", __func__);            return -ENOMEM;        }        tunables->usage_count = 1;        tunables->io_is_busy = true;        tunables->above_hispeed_delay = default_above_hispeed_delay;        tunables->nabove_hispeed_delay =            ARRAY_SIZE(default_above_hispeed_delay);        tunables->go_hispeed_load = DEFAULT_GO_HISPEED_LOAD;        tunables->target_loads = default_target_loads;        tunables->ntarget_loads = ARRAY_SIZE(default_target_loads);        tunables->min_sample_time = DEFAULT_MIN_SAMPLE_TIME;        tunables->timer_rate = DEFAULT_TIMER_RATE;        tunables->boostpulse_duration_val = DEFAULT_MIN_SAMPLE_TIME;        tunables->timer_slack_val = DEFAULT_TIMER_SLACK;        spin_lock_init(&tunables->target_loads_lock);        spin_lock_init(&tunables->above_hispeed_delay_lock);        policy->governor_data = tunables;        if (!have_governor_per_policy())            common_tunables = tunables;        rc = sysfs_create_group(get_governor_parent_kobj(policy),                get_sysfs_attr());        if (rc) {            kfree(tunables);            policy->governor_data = NULL;            if (!have_governor_per_policy())                common_tunables = NULL;            return rc;        }        if (!policy->governor->initialized) {            idle_notifier_register(&cpufreq_interactive_idle_nb);            cpufreq_register_notifier(&cpufreq_notifier_block,                    CPUFREQ_TRANSITION_NOTIFIER);        }#ifdef CONFIG_CPU_FREQ_INPUT_EVNT_NOTIFY        if (!input_handler_register_count) {            cpumask_clear(&interactive_cpumask);            rc = input_register_handler(                    &cpufreq_interactive_input_handler);            if (rc)                return rc;        }        tunables->input_event_freq = policy->max *                DEFAULT_INPUT_EVENT_FRFQ_PERCENT / 100;        tunables->input_dev_monitor = true;        input_handler_register_count++;#endif        break;

时序图如下：

这里写图片描述

经过sysfs_create_group后在/sys/devices/system/cpu/cpufreq/interactive建立了对应的sys节点，节点主要包括：

boost: interactive对突发任务的处理。
boostpulse：对突发任务的处理频率上升后持续的时间
go_hispeed_load：高频阈值。当系统的负载超过该值，升频，否则降频。
hispeed_freq: 当workload达到 go_hispeed_load时，频率将被拉高到这个值
input_boost：对input事件，如触屏等突发处理
min_sample_time:最小采样时间。每次调频结果必须维持至少这个时间。
timer_rate: 采样定时器的采样率。

当CPU不处于idel状态时，timer_rate作为采样速率来计算CPU的workload. 当CPU处于idel状态，此时使用一个可延时定时器，会导致CPU不能从idel状态苏醒来响应定时器. 定时器的最大的可延时时间用timer_slack表示，默认值80000 uS.
- 一个governor如何被启动的？
类似governor初始化，event CPUFREQ_GOV_START被调用：

    case CPUFREQ_GOV_START:        mutex_lock(&gov_lock);        freq_table = cpufreq_frequency_get_table(policy->cpu);        //如果没有设置hispeed_freq的值的话，就设置hispeed_freq为policy->max        if (!tunables->hispeed_freq)            tunables->hispeed_freq = policy->max;        //遍历所有处于online状态的CPU        for_each_cpu(j, policy->cpus) {            pcpu = &per_cpu(cpuinfo, j);            pcpu->policy = policy;            pcpu->target_freq = policy->cur;            pcpu->freq_table = freq_table;            pcpu->floor_freq = pcpu->target_freq;            pcpu->floor_validate_time =                ktime_to_us(ktime_get());            pcpu->hispeed_validate_time =                pcpu->floor_validate_time;            pcpu->max_freq = policy->max;            down_write(&pcpu->enable_sem);            del_timer_sync(&pcpu->cpu_timer);            del_timer_sync(&pcpu->cpu_slack_timer);            //启动相关的定时器             cpufreq_interactive_timer_start(tunables, j);            //启动定时器以后governor就可以工作了，所以设置pcpu->governor_enabled为1            pcpu->governor_enabled = 1;            up_write(&pcpu->enable_sem);        }        mutex_unlock(&gov_lock);        break;

现在，governor 字段被设置为cpufreq_governor_interactive，我们看看它的实现：

static int cpufreq_governor_interactive(struct cpufreq_policy *policy,        unsigned int event){    int rc;    unsigned int j;    struct cpufreq_interactive_cpuinfo *pcpu;    struct cpufreq_frequency_table *freq_table;    struct cpufreq_interactive_tunables *tunables;    unsigned long flags;    if (have_governor_per_policy())        tunables = policy->governor_data;    else        tunables = common_tunables;    WARN_ON(!tunables && (event != CPUFREQ_GOV_POLICY_INIT));    switch (event) {    case CPUFREQ_GOV_POLICY_INIT:        if (have_governor_per_policy()) {            WARN_ON(tunables);        } else if (tunables) {            tunables->usage_count++;            policy->governor_data = tunables;            return 0;        }        tunables = kzalloc(sizeof(*tunables), GFP_KERNEL);        if (!tunables) {            pr_err("%s: POLICY_INIT: kzalloc failed\n", __func__);            return -ENOMEM;        }        tunables->usage_count = 1;        tunables->io_is_busy = true;        tunables->above_hispeed_delay = default_above_hispeed_delay;        tunables->nabove_hispeed_delay =            ARRAY_SIZE(default_above_hispeed_delay);        tunables->go_hispeed_load = DEFAULT_GO_HISPEED_LOAD;        tunables->target_loads = default_target_loads;        tunables->ntarget_loads = ARRAY_SIZE(default_target_loads);        tunables->min_sample_time = DEFAULT_MIN_SAMPLE_TIME;        tunables->timer_rate = DEFAULT_TIMER_RATE;        tunables->boostpulse_duration_val = DEFAULT_MIN_SAMPLE_TIME;        tunables->timer_slack_val = DEFAULT_TIMER_SLACK;        spin_lock_init(&tunables->target_loads_lock);        spin_lock_init(&tunables->above_hispeed_delay_lock);        policy->governor_data = tunables;        if (!have_governor_per_policy())            common_tunables = tunables;        rc = sysfs_create_group(get_governor_parent_kobj(policy),                get_sysfs_attr());        if (rc) {            kfree(tunables);            policy->governor_data = NULL;            if (!have_governor_per_policy())                common_tunables = NULL;            return rc;        }        if (!policy->governor->initialized) {            idle_notifier_register(&cpufreq_interactive_idle_nb);            cpufreq_register_notifier(&cpufreq_notifier_block,                    CPUFREQ_TRANSITION_NOTIFIER);        }#ifdef CONFIG_CPU_FREQ_INPUT_EVNT_NOTIFY        if (!input_handler_register_count) {            cpumask_clear(&interactive_cpumask);            rc = input_register_handler(                    &cpufreq_interactive_input_handler);            if (rc)                return rc;        }        tunables->input_event_freq = policy->max *                DEFAULT_INPUT_EVENT_FRFQ_PERCENT / 100;        tunables->input_dev_monitor = true;        input_handler_register_count++;#endif        break;    case CPUFREQ_GOV_POLICY_EXIT:        if (!--tunables->usage_count) {            if (policy->governor->initialized == 1) {                cpufreq_unregister_notifier(&cpufreq_notifier_block,                        CPUFREQ_TRANSITION_NOTIFIER);                idle_notifier_unregister(&cpufreq_interactive_idle_nb);            }            sysfs_remove_group(get_governor_parent_kobj(policy),                    get_sysfs_attr());            kfree(tunables);            common_tunables = NULL;        }#ifdef CONFIG_CPU_FREQ_INPUT_EVNT_NOTIFY        if (input_handler_register_count > 0)            input_handler_register_count--;        if (!input_handler_register_count) {            cpumask_clear(&interactive_cpumask);            input_unregister_handler(&cpufreq_interactive_input_handler);        }#endif        policy->governor_data = NULL;        break;    case CPUFREQ_GOV_START:        mutex_lock(&gov_lock);        freq_table = cpufreq_frequency_get_table(policy->cpu);        if (!tunables->hispeed_freq)            tunables->hispeed_freq = policy->max;        for_each_cpu(j, policy->cpus) {            pcpu = &per_cpu(cpuinfo, j);            pcpu->policy = policy;            pcpu->target_freq = policy->cur;            pcpu->freq_table = freq_table;            pcpu->floor_freq = pcpu->target_freq;            pcpu->floor_validate_time =                ktime_to_us(ktime_get());            pcpu->hispeed_validate_time =                pcpu->floor_validate_time;            pcpu->max_freq = policy->max;            down_write(&pcpu->enable_sem);            del_timer_sync(&pcpu->cpu_timer);            del_timer_sync(&pcpu->cpu_slack_timer);            cpufreq_interactive_timer_start(tunables, j);            pcpu->governor_enabled = 1;            up_write(&pcpu->enable_sem);        }#ifdef CONFIG_CPU_FREQ_INPUT_EVNT_NOTIFY        cpumask_or(&interactive_cpumask, &interactive_cpumask, policy->cpus);#endif        mutex_unlock(&gov_lock);        break;    case CPUFREQ_GOV_STOP:        mutex_lock(&gov_lock);        for_each_cpu(j, policy->cpus) {            pcpu = &per_cpu(cpuinfo, j);            down_write(&pcpu->enable_sem);            pcpu->governor_enabled = 0;            del_timer_sync(&pcpu->cpu_timer);            del_timer_sync(&pcpu->cpu_slack_timer);            up_write(&pcpu->enable_sem);        }#ifdef CONFIG_CPU_FREQ_INPUT_EVNT_NOTIFY        cpumask_andnot(&interactive_cpumask, &interactive_cpumask, policy->cpus);#endif        mutex_unlock(&gov_lock);        break;    case CPUFREQ_GOV_LIMITS:        if (policy->max < policy->cur)            __cpufreq_driver_target(policy,                    policy->max, CPUFREQ_RELATION_H);        else if (policy->min > policy->cur)            __cpufreq_driver_target(policy,                    policy->min, CPUFREQ_RELATION_L);        for_each_cpu(j, policy->cpus) {            pcpu = &per_cpu(cpuinfo, j);            down_read(&pcpu->enable_sem);            if (pcpu->governor_enabled == 0) {                up_read(&pcpu->enable_sem);                continue;            }            spin_lock_irqsave(&pcpu->target_freq_lock, flags);            if (policy->max < pcpu->target_freq)                pcpu->target_freq = policy->max;            else if (policy->min > pcpu->target_freq)                pcpu->target_freq = policy->min;            spin_unlock_irqrestore(&pcpu->target_freq_lock, flags);            up_read(&pcpu->enable_sem);            /* Reschedule timer only if policy->max is raised.             * Delete the timers, else the timer callback may             * return without re-arm the timer when failed             * acquire the semaphore. This race may cause timer             * stopped unexpectedly.             */            if (policy->max > pcpu->max_freq) {                down_write(&pcpu->enable_sem);                del_timer_sync(&pcpu->cpu_timer);                del_timer_sync(&pcpu->cpu_slack_timer);                cpufreq_interactive_timer_start(tunables, j);                up_write(&pcpu->enable_sem);            }            pcpu->max_freq = policy->max;        }        break;    }    return 0;}

该函数主要初始化两个定时器，cpufreq_interactive_timer和cpufreq_interactive_nop_timer。
关键在于cpufreq_interactive_timer定时器的实现。

这里写图片描述

阅读全文

0 0