[知其然不知其所以然-37] All about RTC

来源:互联网 发布:流体力学计算软件 编辑:程序博客网 时间:2024/05/17 16:45

RTC, is a concept of devices, which stands for Real Time Clock,

so many devices can be used as RTC.

In principle, all these RTC devices are controlled by RTC class in the linux kernel,

which is :

/sys/class/rtc/

and the related code is at:

drivers/rtc/class.c

so there are many different RTC devices need to fill the callbacks

required by RTC class structure, and they are

all independent of the arch, which are all under:

drivers/rtc/


What we care most is the cmos rtc driver, that is, 

drivers/rtc/rtc-cmos.c

This rtc cmos driver is based on the fact that, there is acpi rtc device in the

system:

static const struct pnp_device_id rtc_ids[] = {        { .id = "PNP0b00", },        { .id = "PNP0b01", },        { .id = "PNP0b02", },        { },};

otherwise, it prove the of_match_table in the device tree(mainly used by ARM),

to check if there is any node named:

"motorola,mc146818"
then probe the rtc cmos driver for this device.

After driver is loaded, this RTC class driver, it leverages CMOS_READ/CMOS_WRITE

to satisfy the requirement of RTC class framework, and CMOS RTC is such type

of RTC device that relies on the CMOS(a kind of SRAM which retain its data by

a button battery on th south bridge). And for different architectures, the CMOS_READ/WRITE

are different according to their different io address.

In order to make the life easier, a new wrapped driver called mc146818rtc(made by

Motorola) is used to achieve CMOS_READ/WRITE.

But the code is still ugly that some other parts of code use the CMOS_READ/WRITE

directly, such as mach_get_cmos_time used by persistent_clock, and hpet handler also

uses cmos_read_time without leverage rtc class interface.


Util this point, we are mainly talking about the time/date role  of the RTC provides to the OS.

Actually, the RTC can also be used as an interrupt generator, for example the alarm timer.

Actually RTC support 3 kinds of interrupts, they are periodic timer,alarm interrupt, and update interrupt

respectively. I just want to mention the RTC update interrupt a little more, in theory, there is an

internal 1HZ interrupt which will update the current RTC value to the UTC(increased by 1 every second),

during the update cycle, it is not allowed for the user to read the value of RTC, so two solutions

are leveraged to avoid the update cycle, 1st is the update-end interrupt(AKA update interrupt above),

once the update finished, this interrupt will be sent to CPU to indicate the safety; 2nd is polling at

UIP bit of RTC register(which stands for update-in-process), if this flag is set, users are not allowed to

access RTC value.

OK, let's take a look at the most frequently used interrupt- alarm interrupt,AKA rtcwake,

You can set up an alarm timer by simply configurating the RTC register and give

a timeout value into the register, then once the expire time reaches, an interrupt of 

IRQ 8 will be generated, and if the rtc-cmos driver has registered an IRQ handler for it,

the handler will do some clean work to clear the RTC registers, and then  invoke the 

timeout handler which leverages rtc_update_irq to try_to_wakeup the pending task who

is blocking at the timer event.

Util now, everything works fine. But the disadvantage of RTC is that, it has only one instance in

the system, and meanwhile its frequency is a little slow. So a more flexible

device has been introduced to replace the RTC, that is hpet, which stands for High Precision Event Timer,

it is up to 10MHz, with a maximum of 256 instance(256 Compare registers). So the Linux tries to leverage

hpet to replace the RTC, and the hpet has a interesting feature, that is , if the hpet is working in legacy-replace-mode,

the hpet timer[1] will generate IRQ 8 rather than IRQ 0  on hpet timer[0], see:

<pre name="code" class="cpp">http://www.intel.com/content/dam/www/public/us/en/documents/technical-specifications/software-developers-hpet-spec-1-0a.pdf1 LEG_RT_CNF LegacyReplacement Route:• 0 – Doesn’t support LegacyReplacement Route• 1 – Supports LegacyReplacement RouteIf the ENABLE_CN If the ENABLE_CNF bit and the LEG_RT_CNF bit are both set, then the interrupts will berouted as follows:Timer 0 will be routed to IRQ0 in Non-APIC or IRQ2 in the I/O APIC Timer 1 will be routed to IRQ8 in Non-APIC or IRQ8 in the I/O APIC Timer 2-n will be routed as per the routing in the timer n config registers.If the LegacyReplacement Route bit is set, the individual routing bits for timers 0 and 1(APIC or FSB) will have no impact.If the LegacyReplacement Route bit is not set, the individual routing bits for each of thetimers are used.
So it shows that, if the legacy replace mode of hpet is enabled,irq 8 previously used by rtc will be eaten by hpet. So we should leverage hpet interrupt handler to poll for rtc handling event.Then hpet_rtc_interrupt will take care of the rtc time out and decide whetherto invoke the actual RTC handler when the virtual rtc timer reaches a timeout.

<pre name="code" class="cpp">irqreturn_t hpet_rtc_interrupt(int irq, void *dev_id){ if (hpet_rtc_flags & RTC_AIE &&            (curr_time.tm_sec == hpet_alarm_time.tm_sec) &&            (curr_time.tm_min == hpet_alarm_time.tm_min) &&            (curr_time.tm_hour == hpet_alarm_time.tm_hour))                        rtc_int_flag |= RTC_AF; if (irq_handler)                        irq_handler(rtc_int_flag, dev_id);}

As we already talked something about hpet, let's go a little further ofhpet. The hpet can be both treated as clocksource and a clockevent device,During early bootup, the call of hpet_enable will try to register bothhpet clocksource and clockevent by:hpet_clocksource_register and  hpet_legacy_clockevent_register respectively.
The clocksource can be found at:
<pre name="code" class="cpp">cat /sys/devices/system/clocksource/clocksource0/available_clocksourcetsc hpet acpi_pm
while the hpet clockevent is defined as
static struct clock_event_device hpet_clockevent = {        .name                   = "hpet",        .features               = CLOCK_EVT_FEAT_PERIODIC |                                  CLOCK_EVT_FEAT_ONESHOT,        .set_state_periodic     = hpet_legacy_set_periodic,        .set_state_oneshot      = hpet_legacy_set_oneshot,        .set_state_shutdown     = hpet_legacy_shutdown,        .tick_resume            = hpet_legacy_resume,        .set_next_event         = hpet_legacy_next_event,        .irq                    = 0,        .rating                 = 50,};
However since the rate of hpet is too low, the system per cpuclockevent is replaced by higher priority lapic clock event:<pre name="code" class="cpp">grep . /sys/devices/system/clockevents/clockevent*/current_device/sys/devices/system/clockevents/clockevent0/current_device:lapic/sys/devices/system/clockevents/clockevent1/current_device:lapic/sys/devices/system/clockevents/clockevent2/current_device:lapic/sys/devices/system/clockevents/clockevent3/current_device:lapic
Also from above defination we can see the hpet clockevent willtrigger IRQ 0 on hpet timer[0], and this is coherent withwhat we see in /proc/interrupts.


Here's an example:

It was reported that hibernation could fail on the 2nd attempt, where the
    system hangs at hibernate() -> syscore_resume() -> i8237A_resume() ->
    claim_dma_lock(), because the lock has already been taken.
    
    However there is actually no other process would like to grab this lock on
    that problematic platform.
    
    Further investigation showed that the problem is triggered by setting
    /sys/power/pm_trace to 1 before the 1st hibernation.
    
    Since once pm_trace is enabled, the rtc becomes unmeaningful after suspend,
    and meanwhile some BIOSes would like to adjust the 'invalid' RTC (e.g, smaller
    than 1970) to the release date of that motherboard during POST stage, thus
    after resumed, it may seem that the system had a significant long sleep time
    which is a completely meaningless value.
    
    Then in timekeeping_resume -> tk_debug_account_sleep_time, if the bit31 of the
    sleep time happened to be set to 1, fls() returns 32 and we add 1 to
    sleep_time_bin[32], which causes an out of bounds array access and therefor
    memory being overwritten.
    
    As depicted by System.map:
    0xffffffff81c9d080 b sleep_time_bin
    0xffffffff81c9d100 B dma_spin_lock
    the dma_spin_lock.val is set to 1, which caused this problem.

while Thomas replied that:

"

While the range check is certainly correct and a good thing to have it's wrongin the first place to call __timekeeping_inject_sleeptime() in case thatpm_trace is enabled simply because that "hash" time value will also wreckagetimekeeping. Your patch is just curing the symptom in the debug code but notfixing the root cause.
"

So I need to find a way also fix this issue in timekeeping subsystem. so I asked Thomas, which solution

he likes:

I've done more testings on that problematical machine, and the result shows that,the real offender is that, some BIOSes would like to adjust the 'INVALID' rtc valueto the motherboard released date during POST stage. That is to say, once the rtcis set to an invalid value, the subsequent hibernate/resume would get a meaninglesstsc delta due to the aggressive BIOSes.Here's one of scenarios on how the problem is triggered:1. pm_trace = 1, suspend to disk sets the rtc to year 1912.2. system reboots, BIOS adjusts the rtc to 2013, which is the release date of that motherboard.3. resume from disk, and the sleep time(i.e, delta of tsc) is (2013 - 1912), which caused overflow   in timekeepting_debug.We can avoid this problem by ignoring the idle injection if pm_trace is enabled,but we might still miss other cases which might also break the rtc, e.g, buggyclocksoure/rtc driver, or even user space tool such as hwclock, so actually wecan not distinguish whether the delta of tsc is reasonable(long time sleep or not?), IMO.I've prepared two version of solutions:1st is to bypass the sleep time if pm_trace is involved(a little complicated    because it needs to deal with historical pm_trace), or2nd version is to introduce a sysfs interface to allow the    users to specify the threshold of sleep time, any delta    of tsc bigger than the threshold would be regarded as invalid.Both two solutions contain the fix for the overflow in timekeeping_debug.May I know which one do you prefer, or any suggestions would be appreciated.

Thus Thomas really doesn't want to overkill, thus he suggest solution 1:

That one is really overkill. We provide a proper sanity measure for this andbe done with it.

Notice, Thomas really wants a simple solution.

here's version 2:

So this patch ignores the sleep time if pm_trace is enabled inthe following situation:1. rtc is used as persist clock to compensate for sleep time,   (because system does not have a nonstop clocksource) or2. rtc is used to calculate the sleep time in rtc_resume.
Because there are mainly two parts in the code would inject the sleep time ,

OK. I've modified the patch.In case I break any other stuff :p, could you help checkif this patch is in the right direction, thanks:1.  There are two places would invoke __timekeeping_inject_sleeptime(),    they are timekeeping_resume and rtc_resume, so we need to deal with    them respctively.2.  for rtc_resume, if the pm_trace has once been enabled,    we bypass the injection of sleep time.3. for timekeeping_resume,     Currently we either use nonstop clock source, or use persistent     clock to get the sleep time. As pm_trace breaks systems who use rtc     as a persistent clock, x86 is affected. So we add a     check for x86 that, if the pm_trace has been enabled, we can not     trust the persistent clock delta read from rtc, thus bypass     the injection of sleep time in this case.4. Why we checked the history of pm_trace: once pm_trace   has been enabled, the delta of rtc would not be reliable anymore.   For example, if we only check current pm_trace, we might still get   memory overwrite:   4.1 echo 1 > /sys/power/pm_trace   4.2 hibernate/resume (rtc is broken, do not add delta from rtc because pm_trace is 1)   4.3 echo 0 > /sys/power/pm_trace   4.4 hibernate/resume (rtc is still broken, but add delta from rtc because pm_trace is 0)   so we have to check if pm_trace has once been enabled, if it is, we will not add   any delta from tsc until system reboots.

here's the patch:

Index: linux/kernel/time/timekeeping.c===================================================================--- linux.orig/kernel/time/timekeeping.c+++ linux/kernel/time/timekeeping.c@@ -1448,6 +1448,11 @@  void __weak read_boot_clock64(struct tim ts->tv_nsec = 0; } +bool __weak persistent_clock_is_usable(void)+{+return true;+}+ /* Flag for if timekeeping_resume() has injected sleeptime */ static bool sleeptime_injected; @@ -1609,6 +1614,7 @@  void timekeeping_resume(void) unsigned long flags; struct timespec64 ts_new, ts_delta; cycle_t cycle_now, cycle_delta;+bool persist_clock_usable = true;  sleeptime_injected = false; read_persistent_clock64(&ts_new);@@ -1660,9 +1666,11 @@  void timekeeping_resume(void) } else if (timespec64_compare(&ts_new, &timekeeping_suspend_time) > 0) { ts_delta = timespec64_sub(ts_new, timekeeping_suspend_time); sleeptime_injected = true;+if (!persistent_clock_is_usable())+persist_clock_usable = false; } -if (sleeptime_injected)+if (sleeptime_injected && persist_clock_usable) __timekeeping_inject_sleeptime(tk, &ts_delta);  /* Re-base the last cycle value */Index: linux/arch/x86/kernel/rtc.c===================================================================--- linux.orig/arch/x86/kernel/rtc.c+++ linux/arch/x86/kernel/rtc.c@@ -8,6 +8,7 @@  #include <linux/export.h> #include <linux/pnp.h> #include <linux/of.h>+#include <linux/pm-trace.h>  #include <asm/vsyscall.h> #include <asm/x86_init.h>@@ -147,6 +148,10 @@  void read_persistent_clock(struct timesp x86_platform.get_wallclock(ts); } +bool persistent_clock_is_usable(void)+{+return !pm_trace_once_enabled();+}  static struct resource rtc_resources[] = { [0] = {Index: linux/drivers/rtc/class.c===================================================================--- linux.orig/drivers/rtc/class.c+++ linux/drivers/rtc/class.c@@ -20,6 +20,7 @@  #include <linux/idr.h> #include <linux/slab.h> #include <linux/workqueue.h>+#include <linux/pm-trace.h>  #include "rtc-core.h" @@ -138,7 +139,7 @@  static int rtc_resume(struct device *dev sleep_time = timespec64_sub(sleep_time, timespec64_sub(new_system, old_system)); -if (sleep_time.tv_sec >= 0)+if ((sleep_time.tv_sec >= 0) && (!pm_trace_once_enabled()) ) timekeeping_inject_sleeptime64(&sleep_time); rtc_hctosys_ret = 0; return 0;Index: linux/kernel/power/main.c===================================================================--- linux.orig/kernel/power/main.c+++ linux/kernel/power/main.c@@ -532,6 +532,7 @@  power_attr(wake_unlock);  #ifdef CONFIG_PM_TRACE int pm_trace_enabled;+bool pm_trace_been_enabled;  static ssize_t pm_trace_show(struct kobject *kobj, struct kobj_attribute *attr,      char *buf)@@ -548,6 +549,7 @@  pm_trace_store(struct kobject *kobj, str if (sscanf(buf, "%d", &val) == 1) { pm_trace_enabled = !!val; if (pm_trace_enabled) {+pm_trace_been_enabled = true; pr_warn("PM: Enabling pm_trace changes system date and time during resume.\n" "PM: Correct system time has to be restored manually after resume.\n"); }Index: linux/include/linux/pm-trace.h===================================================================--- linux.orig/include/linux/pm-trace.h+++ linux/include/linux/pm-trace.h@@ -6,12 +6,18 @@  #include <linux/types.h>  extern int pm_trace_enabled;+extern bool pm_trace_been_enabled;  static inline int pm_trace_is_enabled(void) {        return pm_trace_enabled; } +static inline bool pm_trace_once_enabled(void)+{+       return pm_trace_been_enabled;+}+ struct device; extern void set_trace_device(struct device *); extern void generate_pm_trace(const void *tracedata, unsigned int user);@@ -25,6 +31,7 @@  extern int show_trace_dev_match(char *bu #else  static inline int pm_trace_is_enabled(void) { return 0; }+static inline bool pm_trace_once_enabled(void) { return false; }  #define TRACE_DEVICE(dev) do { } while (0) #define TRACE_RESUME(dev) do { } while (0)
Then Rafael has some comments on pm_trace history:

>    4.1 echo 1 > /sys/power/pm_trace>    4.2 hibernate/resume (rtc is broken, do not add delta from rtc because pm_trace is 1)>    4.3 echo 0 > /sys/power/pm_trace>    4.4 hibernate/resume (rtc is still broken, but add delta from rtc because pm_trace is 0)The initial state of the RTC is invalid, but will the delta be still invalid?And what if the admin fixes up the RTC before hibernating?  You will still discardthe RTC delta until the next reboot, right?
> And what if the admin fixes up the RTC before hibernating?  You will still> discard the RTC delta until the next reboot, right?Yes, it will be discarded, I agree we should not bypass the delta if someone hasfixed the rtc, I'll dig into the code.


Then Xunlei jumped out and said the weak function is not acceptible, and points out that there is a bug

in previous patch, that is , not all the x86 platforms

use cmos rtc as the persistent clock source, thus we should not

only check pm_trace in persistent_clock_is_usable, and must also

check the get_wallclock is assigned with rtc_cmos.

>

> +bool __weak persistent_clock_is_usable(void)> +{> +return true;> +}> +I suddenly think of a way to avoid adding this ugly __weak auxiliary function.Add a special treatment for read_persistent_clock() in arch/x86/kernel/rtc.c as follows,void read_persistent_clock(struct timespec *ts){    x86_platform.get_wallclock(ts);    /* Make rtc-based persistent clock unusable if pm_trace is enabled. */    if (pm_trace_is_enabled() &&        x86_platform.get_wallclock == mach_get_cmos_time) {        ts->tv_sec = 0;        ts->tv_nsec = 0;    }}In this way, we can avoid the touch of timekeeping core, after all ptrace is currently x86-specific.What do you think?

But  after modification, Rafael does not like this idea,

> + */> +if (pm_trace_is_enabled() &&> +    x86_platform.get_wallclock == mach_get_cmos_time) {> +ts->tv_sec = 0;> +ts->tv_nsec = 0;> +}I'm not sure about this.  Looks hackish.

The same as Thomas:

> I'm not sure about this.  Looks hackish.Indeed. Can't you just keep track that pm_trace fiddled with the cmos clockand then discard the value either in the core or in mach_get_cmos_time()

Actually I don't know what they mean, thus I mentioned, there is a previous patch that

deal with delta directly:

> Indeed. Can't you just keep track that pm_trace fiddled with the cmos clock> and then discard the value either in the core or in mach_get_cmos_time()The previous version is more straightforward, sinceit ignored the bogus rtc in core. Would you please takea glance at it too, thanks:https://patchwork.kernel.org/patch/9287347/
But actually above patch also uses x86, as Thomas said:

> https://patchwork.kernel.org/patch/9287347/This is the same hackery just different:> +bool persistent_clock_is_usable(void)> +{> +/* Unusable if pm_trace is enabled. */> +return !((x86_platform.get_wallclock == mach_get_cmos_time) &&> +        pm_trace_is_enabled());> +}I really have no idea why this is burried in x86 land. The pm_trace hackeryissues mc146818_set_time() to fiddle with the RTC. So any implementation ofthis is affected.So that very piece of pmtrace code should keep track of the wreckage it didto the RTC and provide the fact to the core timekeeping code which can thenskip the update.

unfortunately I did not get what he means, he just doesn't like the x86 part, and wants to get rid of it,

what I thought was that Thomas might wants to add the quirk in mc1468_set_time,

but I did not find a way to completely get rid of it, thus another ugly patch comes out:

 I really have no idea why this is burried in x86 land. The pm_trace hackery> issues mc146818_set_time() to fiddle with the RTC. So any implementation of> this is affected.OK, I've changed this patch according to this suggestion.Previously I tried to deal with the case that x86 usesRTC-CMOS as its presistent clock, not kvm_clock, norother get_wallclock, and this seems only be related to x86.So this piece of code looks hackish.> > So that very piece of pmtrace code should keep track of the wreckage it did> to the RTC and provide the fact to the core timekeeping code which can then> skip the update.> OK. I've moved most of the logic into the pm_trace component,Once the mc146818_set_time has modified the RTC by pm_trace,related flags will be set which indicates the unusable of RTC.And timekeeping system is able to query these flags to decide whetherit should inject the sleep time. (We tried to make this patch assimple as possible, but it looks like we have to deal with persistentclock for x86, which makes this patch a little more complicated).Here's the trial version of it, any suggestion would be appreciated:

Index: linux/drivers/base/power/trace.c===================================================================--- linux.orig/drivers/base/power/trace.c+++ linux/drivers/base/power/trace.c@@ -75,6 +75,27 @@  #define DEVSEED (7919)  static unsigned int dev_hash_value;+unsigned int timekeeping_tainted;++/* Is the persistent clock effected by pm_trace? */+int __weak arch_pm_trace_taint_pclock(void)+{+return 0;+}++void pm_trace_untaint_timekeeping(void)+{+timekeeping_tainted = 0;+}++void pm_trace_taint_timekeeping(void)+{+if (pm_trace_is_enabled()) {+timekeeping_tainted |= TIMEKEEPING_RTC_TAINTED;+if (arch_pm_trace_taint_pclock())+timekeeping_tainted |= TIMEKEEPING_PERSISTENT_CLOCK_TAINTED;+}+}  static int set_magic_time(unsigned int user, unsigned int file, unsigned int device) {@@ -104,6 +125,7 @@  static int set_magic_time(unsigned int u time.tm_min = (n % 20) * 3; n /= 20; mc146818_set_time(&time);+pm_trace_taint_timekeeping(); return n ? -1 : 0; } Index: linux/kernel/power/main.c===================================================================--- linux.orig/kernel/power/main.c+++ linux/kernel/power/main.c@@ -548,6 +548,7 @@  pm_trace_store(struct kobject *kobj, str if (sscanf(buf, "%d", &val) == 1) { pm_trace_enabled = !!val; if (pm_trace_enabled) {+pm_trace_untaint_timekeeping(); pr_warn("PM: Enabling pm_trace changes system date and time during resume.\n" "PM: Correct system time has to be restored manually after resume.\n"); }Index: linux/include/linux/pm-trace.h===================================================================--- linux.orig/include/linux/pm-trace.h+++ linux/include/linux/pm-trace.h@@ -1,10 +1,14 @@  #ifndef PM_TRACE_H #define PM_TRACE_H +#define TIMEKEEPING_RTC_TAINTED 0x1+#define TIMEKEEPING_PERSISTENT_CLOCK_TAINTED 0x2+ #ifdef CONFIG_PM_TRACE #include <asm/pm-trace.h> #include <linux/types.h> +extern unsigned int timekeeping_tainted; extern int pm_trace_enabled;  static inline int pm_trace_is_enabled(void)@@ -12,10 +16,23 @@  static inline int pm_trace_is_enabled(vo        return pm_trace_enabled; } +static inline int pm_trace_rtc_is_tainted(void)+{+return (timekeeping_tainted & TIMEKEEPING_RTC_TAINTED) ?+1 : 0;+}++static inline int pm_trace_pclock_is_tainted(void)+{+return (timekeeping_tainted & TIMEKEEPING_PERSISTENT_CLOCK_TAINTED) ?+1 : 0;+}+ struct device; extern void set_trace_device(struct device *); extern void generate_pm_trace(const void *tracedata, unsigned int user); extern int show_trace_dev_match(char *buf, size_t size);+extern void pm_trace_untaint_timekeeping(void);  #define TRACE_DEVICE(dev) do { \ if (pm_trace_enabled) \@@ -25,6 +42,8 @@  extern int show_trace_dev_match(char *bu #else  static inline int pm_trace_is_enabled(void) { return 0; }+static inline int pm_trace_rtc_is_tainted(void) { return 0; }+static inline int pm_trace_pclock_is_tainted(void) { return 0; }  #define TRACE_DEVICE(dev) do { } while (0) #define TRACE_RESUME(dev) do { } while (0)Index: linux/arch/x86/kernel/rtc.c===================================================================--- linux.orig/arch/x86/kernel/rtc.c+++ linux/arch/x86/kernel/rtc.c@@ -8,6 +8,7 @@  #include <linux/export.h> #include <linux/pnp.h> #include <linux/of.h>+#include <linux/pm-trace.h>  #include <asm/vsyscall.h> #include <asm/x86_init.h>@@ -146,6 +147,10 @@  void read_persistent_clock(struct timesp x86_platform.get_wallclock(ts); } +int arch_pm_trace_taint_pclock(void)+{+return (x86_platform.get_wallclock == mach_get_cmos_time);+}  static struct resource rtc_resources[] = { [0] = {Index: linux/kernel/time/timekeeping.c===================================================================--- linux.orig/kernel/time/timekeeping.c+++ linux/kernel/time/timekeeping.c@@ -23,6 +23,7 @@  #include <linux/stop_machine.h> #include <linux/pvclock_gtod.h> #include <linux/compiler.h>+#include <linux/pm-trace.h>  #include "tick-internal.h" #include "ntp_internal.h"@@ -1554,7 +1555,7 @@  static void __timekeeping_inject_sleepti  */ bool timekeeping_rtc_skipresume(void) {-return sleeptime_injected;+return sleeptime_injected || pm_trace_rtc_is_tainted(); }  /**@@ -1664,7 +1665,7 @@  void timekeeping_resume(void) sleeptime_injected = true; } else if (timespec64_compare(&ts_new, &timekeeping_suspend_time) > 0) { ts_delta = timespec64_sub(ts_new, timekeeping_suspend_time);-sleeptime_injected = true;+sleeptime_injected = pm_trace_pclock_is_tainted() ? false : true; }  if (sleeptime_injected)
Thus there is still x86 involved.


OK,

The pain point is , x86 may use rtc-cmos as its persistent clock source, but Thomas thinks

it is ugly to use something like

x86_platform.get_wallclock == mach_get_cmos_time

OK,  this reveal the fact that, we'd better not change the logic , but only by returning a result

to indicate the prefered result, this is a good design, so the magic is,

we add checking in mach_get_cmos_time directly, if pm_trace is enabled, return a error.

But the terrible thing is, mach_get_cmos_time does not have return value at all! Jesus,

how to deal with it?  Take it easy man, we can assign zero to mach_get_cmos_time thus

we read a zero of persistent clock value, which will not be used to be added to the sleep time.

With the same idea, we do some hack  in the 'leaf' function, for the situation of rtc-cmos used

in rtc_resume, thus we modified the return value to error in rtc_read, thus in cmos_read,

then rtc_resume will ignore the rtc returned from rtc_cmos thus does not add sleep time for it.


Let's see what Thomas do:

n Fri, 9 Sep 2016, Chen Yu wrote:> > I really have no idea why this is burried in x86 land. The pm_trace hackery> > issues mc146818_set_time() to fiddle with the RTC. So any implementation of> > this is affected.> OK, I've changed this patch according to this suggestion.No you did not! You still have that silly x86 hackery in place. > OK. I've moved most of the logic into the pm_trace component,> Once the mc146818_set_time has modified the RTC by pm_trace,> related flags will be set which indicates the unusable of RTC.> And timekeeping system is able to query these flags to decide whether> it should inject the sleep time. (We tried to make this patch as> simple as possible, but it looks like we have to deal with persistent> clock for x86, which makes this patch a little more complicated).> Here's the trial version of it, any suggestion would be appreciated:It's overengineered and horrible. > +unsigned int timekeeping_tainted;It does not taint timekeeping. It just wreckages the RTC.> +void pm_trace_taint_timekeeping(void)> +{> +if (pm_trace_is_enabled()) {> +timekeeping_tainted |= TIMEKEEPING_RTC_TAINTED;> +if (arch_pm_trace_taint_pclock())> +timekeeping_tainted |= TIMEKEEPING_PERSISTENT_CLOCK_TAINTED;> +}Why would you need all these flags?> +static inline int pm_trace_rtc_is_tainted(void)> +{> +return (timekeeping_tainted & TIMEKEEPING_RTC_TAINTED) ?> +1 : 0;ever heard about bool?> +}> +> +extern void pm_trace_untaint_timekeeping(void);And how exactly do you untaint it? Just by clearing the flags. That makesthe RTC time magically correct again?> +int arch_pm_trace_taint_pclock(void)> +{> +return (x86_platform.get_wallclock == mach_get_cmos_time);> +}Groan. I told you to do it in the mc14xxx related places. There are notthat many in the kernelHere is a completely uncompiled/untested patch which should address theissue in a halfways clean way.

--- a/arch/x86/kernel/rtc.c+++ b/arch/x86/kernel/rtc.c@@ -64,6 +64,15 @@  void mach_get_cmos_time(struct timespec unsigned int status, year, mon, day, hour, min, sec, century = 0; unsigned long flags; +/*+ * If pm trace abused the RTC as storage set the timespec to 0+ * which tells the caller that this RTC value is bogus.+ */+if (!pm_trace_rtc_valid()) {+now->tv_sec = now->tv_nsec = 0;+return;+}+ spin_lock_irqsave(&rtc_lock, flags);  /*--- a/drivers/base/power/trace.c+++ b/drivers/base/power/trace.c@@ -74,6 +74,7 @@   #define DEVSEED (7919) +bool pm_trace_rtc_abused __read_mostly; static unsigned int dev_hash_value;  static int set_magic_time(unsigned int user, unsigned int file, unsigned int device)@@ -104,6 +105,7 @@  static int set_magic_time(unsigned int u time.tm_min = (n % 20) * 3; n /= 20; mc146818_set_time(&time);+pm_trace_rtc_abused = true; return n ? -1 : 0; } --- a/drivers/rtc/rtc-cmos.c+++ b/drivers/rtc/rtc-cmos.c@@ -189,6 +189,13 @@  static inline void cmos_write_bank2(unsi  static int cmos_read_time(struct device *dev, struct rtc_time *t) {+/*+ * If pmtrace abused the RTC for storage tell the caller that it is+ * unusable.+ */+if (!pm_trace_rtc_valid())+return -EIO;+ /* REVISIT:  if the clock has a "century" register, use  * that instead of the heuristic in mc146818_get_time().  * That'll make Y3K compatility (year > 2070) easy!--- a/include/linux/mc146818rtc.h+++ b/include/linux/mc146818rtc.h@@ -16,6 +16,7 @@  #include <asm/mc146818rtc.h>/* register access macros */ #include <linux/bcd.h> #include <linux/delay.h>+#include <linux/pm-trace.h>  #ifdef __KERNEL__ #include <linux/spinlock.h>/* spinlock_t */--- a/include/linux/pm-trace.h+++ b/include/linux/pm-trace.h@@ -6,6 +6,12 @@  #include <linux/types.h>  extern int pm_trace_enabled;+extern bool pm_trace_rtc_abused;++static inline bool pm_trace_rtc_valid(void)+{+return !pm_trace_rtc_abused;+}  static inline int pm_trace_is_enabled(void) {@@ -24,6 +30,7 @@  extern int show_trace_dev_match(char *bu  #else +static inline bool pm_trace_rtc_valid(void) { return true; } static inline int pm_trace_is_enabled(void) { return 0; }  #define TRACE_DEVICE(dev) do { } while (0)


0 0