synchronization---per-CPU variable

来源:互联网 发布:爱淘客网站源码 编辑:程序博客网 时间:2024/06/05 11:14
====================================================================================


Index:


1. Intro


2. Reference


3. Basic theory


4. APIs
    4.1. APIs for static per-CPU variable
    4.2. APIs for dynamic per-CPU variable


5. Implementation details, based on kernel 2.6.11.12
    5.1. impl of static per-CPU variable
        5.1.1. UP version
        5.1.2. SMP version
    5.2. impl of dynamic per-CPU variable
        5.2.1. UP version
        5.2.2. SMP version


6. Misc tips




====================================================================================


1. Intro


This doc describles per-CPU variable.




====================================================================================


2. Reference


    [1] ulk, ulk - OReilly.Understanding.The.Linux.Kernel.3rd.Edition
        //5.2.1. Per-CPU Variables




====================================================================================


3. Basic theory


<<ulk>>
    /5.2. Synchronization Primitives
        Table 5-2. Various types of synchronization techniques used by the kernel


        Technique               Description                                         Scope
 
        Per-CPU variables       Duplicate a data structure among the CPUs           All CPUs




The basic theory of per-CPU variable is:


    For per-CPU variables, the kernel arrange them like below:


         *  variable #0                 variable #1                variable #2
         *  -------------------          -------------------        ------------
         * | u0 | u1 | u2 | u3 |        | u0 | u1 | u2 | u3 |      | u0 | u1 | u
         *  -------------------  ......  -------------------  ....  ------------




    A per-CPU variable is in fact a array-like structure, its has NR_CPU elements, each element corresponds to a CPU.
    [*] <<ulk>> says each element is aligned to CPU cache line, that is impl detail, we will see.


    Then, the code only access the local CPU copy of the per-CPU variable.




per-CPU variable is divided into 2 types:


    static per-CPU variable


        like simple static variable, it is directly compiled and linked into vmlinux or module.




    dynamic per-CPU variable


        like simple dynamic variable, it is dynamically allocated in dynamic memory area.






As a synchronization techniqure, per-CPU variable alone is not that riable, consider the following scenario:


        task #0, system call service routine is accessing a per-CPU variable local copy on CPU #0.


            a HW IRQ issued, hardirq handler interrupts system call service routine, and run.
            This hardirq handler wakes up a higher priority task #1.


            harirq handler returns.
            During IRET, kernel preemption happens, task #1 preempts task #0.


        task #1 get to run.


        .....


        AFTER sometime, task #0 is migrated to other CPU #1, and get scheduled and resumed.


        !!!__but now, task #0 is still accessing the per-CPU variable copy of CPU #0, not CPU #1. This causes problem.




So per-CPU variable MUST be used with other synchronization techniques, when accessing per-CPU variables, we need to:


        disable preemption


            This prevents the scenario above, so task #0 always on CPU #0 during its access to per-CPU variables.




        disable softirq             # including _lock_bh
        disable hardirq             # include _lock_irq / lock_irqsave


            These implicitly disable preemption.
            
            But additionally, if softirq / hardirq can possibly access a shared per-CPU variable, then, it is needed.               In fact, these 2 are rules of using locks. See:
                <<kdoc - kernerl-locking>>




[*] Note that, in scenario above, we use system call service routine as a example, but this DOES NOT mean per-CPU variable is only used in "user context", it can be used in ANY context.




====================================================================================


4. APIs


We use different sets of APIs to manipulate static per-CPU variable and dynamic per-CPU variable.


For use per-CPU variable, just:
    #include <linux/percpu.h>


Don't include other percpu.h, which are about the architecture-specific implementation details of per-CPU variable.




[*] Note that, the following APIs are from kernel 2.6.11.12, the APIs of recent kernel keep the same, but the implementation changes a lot. For simplicity, we use kernel 2.6.11.12 for description.




====================================================================================


4.1. APIs for static per-CPU variable


#
#       DECLARE_PER_CPU()
#   is to externly declare a static per-CPU variable, because it uses 'extern' keyword in DECLARE_PER_CPU_SECTION().
#
#   It is ususally used for declaring per-CPU variable in header file, or forward declaration in C file.
#
#
#       DEFINE_PER_CPU()
#   is to define a static per-CPU variable.
#
#   It is used in C file.
#


#define DECLARE_PER_CPU(type, name)
#define DEFINE_PER_CPU(type, name)




#   
#       per_cpu(var, cpu)
# Selects the element for CPU cpu of the per-CPU array name.
#
# Note that, per_cpu() is NOT to retrieve local CPU copy of per-CPU variable, but to retrieve the per-CPU variable copy
# of the specified CPU(!^^__by CPU index).
#
# It should be considered as a internal API, and it is rare to use it in common kernel programming.
#


#define per_cpu(var, cpu) (*RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu]))




#
#       __get_cpu_var(var()
# Get the local copy of per-CPU variable(!^^__that is, smp_processor_id() returns the local CPU index, and then, pass
# the local CPU index to per_cpu).
#


#define __get_cpu_var(var) per_cpu(var, smp_processor_id())




#
#       get_cpu_var(var)
# Disables kernel preemption, then selects the local CPU's element of the per-CPU array name
#
#       put_cpu_var(var)
# Enables kernel preemption (name is not used)
#

# As we can see, get_cpu_var() / put_cpu_var() internally disable / enable kernel preemption, so they are most commonly
# used APIs for static per-CPU variable in programming.
#


#define get_cpu_var(var) (*({ preempt_disable(); &__get_cpu_var(var); }))
#define put_cpu_var(var) preempt_enable()




===================================================================================


4.2. APIs for dynamic per-CPU variable


#
#       alloc_percpu(type)
# Dynamically allocates a per-CPU array of type data structures and returns its address
#


#define alloc_percpu(type) \
    ((type *)(__alloc_percpu(sizeof(type), __alignof__(type))))




#
#       free_percpu(pointer)
# Releases a dynamically allocated per-CPU array at address pointer
#


static inline void free_percpu(const void *ptr)




#
#       per_cpu_ptr(pointer, cpu)
# Returns the address of the element for CPU cpu of the per-CPU array at address pointer
#
# Note that, unlike get_cpu_var() / put_cpu_var(), per_cpu_ptr() does not disable kernel preemption for us, so it is like
# __get_cpu_var(), so when we use per_cpu_ptr(), we need to disable preemption ourself.
#


#define per_cpu_ptr(ptr, cpu)                   \
    ({                                              \
            struct percpu_data *__p = (struct percpu_data *)~(unsigned long)(ptr); \
            (__typeof__(ptr))__p->ptrs[(cpu)];  \
    })




====================================================================================


5. Implementation details, based on kernel 2.6.11.12


Although the theory / sematics / APIs remain the same for per-CPU variable across different kernel version, but in recent kernel, the internal implementation changes a lot, and much more complicated(!^^__same thing happen to workQ...).


For simplicity, here we describes the implementation based on kernel 2.6.11.12, which is enough to understand the internal of per-CPU variable.




====================================================================================


5.1. impl of static per-CPU variable




====================================================================================


5.1.1. UP version


#
# [*] In fact, when we use "name" to define a static per-CPU variable, the name of this per-CPU variable is not
# directly "name", but preappended with a prefix "per_cpu__". This handling is common to UP / SMP.
#
# In UP version, per-CPU variable is just defined simply like regular variables, no special handling, because there is
# only one CPU, so there is only one element, no need to define per-CPU variable as an array.
#


#define DEFINE_PER_CPU(type, name) \
    __typeof__(type) per_cpu__##name




#
# So, In UP, per_cpu() and __get_cpu_var() just return the per-CPU variable directly.
#


#define per_cpu(var, cpu)           (*((void)cpu, &per_cpu__##var))
#define __get_cpu_var(var)          per_cpu__##var




#
# get_cpu_var() / put_cpu_var() are common to UP / SMP, it is the internal __get_cpu_var() they called which makes 
# difference.
#
# Note that, even in UP, get_cpu_var() also disable kernel preemption. because it need to avoid the following case:
#       task #0 is preempted by task #1, when accessing a per-CPU variable.
#       task #1 will also access a same per-CPU variable.
#       task #1 scheduled again, it sees a inconsistent view of this per-CPU variable.
#


#define get_cpu_var(var) (*({ preempt_disable(); &__get_cpu_var(var); }))
#define put_cpu_var(var) preempt_enable()




====================================================================================


5.1.2. SMP version


#
# In SMP, DEFINE_PER_CPU() performs some special handling when defining a per-CPU variable.
#
# It uses the section attribute ".data.percpu", then, the per-CPU variable would be compiled and linked into
# ".data.percpu" section of vmlinux or module.
#
#[*] Note that, even for SMP, a per-CPU variable is NOT directly defined as "a array of NR_CPU element", we will
# see how kernel handles this thing soon.
#


#define DEFINE_PER_CPU(type, name) \
    __attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu__##name






#
# get_cpu_var() / put_cpu_var() are common to UP / SMP, it is the internal __get_cpu_var() they called which makes 
# difference.
#


#define get_cpu_var(var) (*({ preempt_disable(); &__get_cpu_var(var); }))
#define put_cpu_var(var) preempt_enable()




#
# per_cpu() of SMP is different from that of UP, it is:
#
#       &"per_cpu__##var" + __per_cpu_offset[cpu]
#


#define __get_cpu_var(var) per_cpu(var, smp_processor_id())
#define per_cpu(var, cpu) (*RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu]))


    # define RELOC_HIDE(ptr, off)                   \
          ({ unsigned long __ptr;                   \
             __ptr = (unsigned long) (ptr);             \
            (typeof(ptr)) (__ptr + (off)); })




-----------------------------------------------------------------------------------


#
# setup_per_cpu_areas() is to set up memory area containing static per-CPU variable of vmlinux.
#
@@trace - how kernel handles static per-CPU variable of vmlinux.


start_kernel()


    setup_per_cpu_areas();


        #
        # __per_cpu_start[] and __per_cpu_end[] are 2 linker symbols, defined in:
        #       /arch/$(arch)/kernel/vmlinux.lds.S  -   x86, mips, ppc
        # like:
        #         __per_cpu_start = .;
        #         .data.percpu  : { *(.data.percpu) }
        #         __per_cpu_end = .;
        #         . = ALIGN(4096);
        #
        # As we see, they identify the start and the end of ".data.percpu" section, which contains all the 
        # static per-CPU variables of vmlinux.
        #


        /* Created by linker magic */
        extern char __per_cpu_start[], __per_cpu_end[];


        #
        # Compute the size of ".data.percpu" section.
        #
        # Allocate a memory area of "size of .data.percpu" x NR_CPUS, from bootmem allocator.
        #
        # Copy the content of ".data.percpu" section, into this memory area, duplicated NR_CPUS copy.
        # And set __per_cpu_offset[] accordingly.
        #


        /* Copy section for each CPU (we discard the original) */
        size = ALIGN(__per_cpu_end - __per_cpu_start, SMP_CACHE_BYTES);
        #ifdef CONFIG_MODULES
            if (size < PERCPU_ENOUGH_ROOM)
            size = PERCPU_ENOUGH_ROOM;
        #endif


        ptr = alloc_bootmem(size * NR_CPUS);


        for (i = 0; i < NR_CPUS; i++, ptr += size) {
            __per_cpu_offset[i] = ptr - __per_cpu_start;
            memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
        }




-----------------------------------------------------------------------------------


As we can see from above, the memory area allocated by setup_per_cpu_areas(), is in fact of the following layout:


        ------------------------------      <-  __per_cpu_offset[0]


            range #0 for CPU #0


        ------------------------------
            SMP_CACHE_BYTES alignment
        ------------------------------      <-  __per_cpu_offset[1]


            range #1 for CPU #1


        ------------------------------
                    .
                    .
                    .
        ------------------------------
            SMP_CACHE_BYTES alignment
        ------------------------------      <-  __per_cpu_offset[N]


            range #N for CPU #N


        ------------------------------


And in theory, a per-CPU variale is an "array of element", but in implementation, the elements are NOT organized continously in RAM, but separated in the different "range", like the following:


        -------------------------------------------         <-  __per_cpu_offset[0]


            range #0 for CPU #0


                -----------------------------------
                CPU #0 copy of per-CPU variable #a
                -----------------------------------
                CPU #0 copy of per-CPU variable #b
                -----------------------------------
                CPU #0 copy of per-CPU variable #c
                -----------------------------------


        -------------------------------------------
            SMP_CACHE_BYTES alignment
        -------------------------------------------         <-  __per_cpu_offset[1]


            range #1 for CPU #1


                -----------------------------------
                CPU #1 copy of per-CPU variable #a
                -----------------------------------
                CPU #1 copy of per-CPU variable #b
                -----------------------------------
                CPU #1 copy of per-CPU variable #c
                -----------------------------------


        -------------------------------------------
                    .
                    .
                    .
        -------------------------------------------
            SMP_CACHE_BYTES alignment
        -------------------------------------------         <-  __per_cpu_offset[N]


            range #N for CPU #N


                -----------------------------------
                CPU #N copy of per-CPU variable #a
                -----------------------------------
                CPU #N copy of per-CPU variable #b
                -----------------------------------
                CPU #N copy of per-CPU variable #c
                -----------------------------------


        -------------------------------------------




In fact ".data.section" of vmlinux will be released after the memory area above has been constructed (!^^__perharps in bootmem allocator retire time), And actually SMP per_cpu() is to return the pointer to the per-CPU range of the memory area, by:


    #
    #       &"per_cpu__##var" + __per_cpu_offset[cpu]
    #
    # [*] Note that, "per_cpu__##var" is the original address value(!^^__known in compile/link time) of 
    # per-CPU variable in ".data.section", which is discared. But we never access this address, but just add
    # __per_cpu_offset[cpu] to it, to get the actual address value of the copy of per-CPU variable specified, 
    # in the per-CPU range of the memory area.
    #
    # [*] So, this is why we don't define a "array of element" in SMP DEFINE_PER_CPU(), and how per_cpu() works.
    #   


    #define per_cpu(var, cpu) (*RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu]))


        # define RELOC_HIDE(ptr, off)                   \
              ({ unsigned long __ptr;                   \
                 __ptr = (unsigned long) (ptr);             \
                (typeof(ptr)) (__ptr + (off)); })




-----------------------------------------------------------------------------------


[*] How to handle static per-CPU variable in module ???


    Because we use the same APIs to access static per-CPU variables in kernel and module, so the module per-CPU variables are also organized into the per-CPU memory range, described by __per_cpu_offset[NR_CPUS].


    As we see from:


        setup_per_cpu_areas();
    
            /* Copy section for each CPU (we discard the original) */
            size = ALIGN(__per_cpu_end - __per_cpu_start, SMP_CACHE_BYTES);
            #ifdef CONFIG_MODULES
                if (size < PERCPU_ENOUGH_ROOM)
                size = PERCPU_ENOUGH_ROOM;
            #endif      




        /* Enough to cover all DEFINE_PER_CPUs in kernel, including modules. */
        #ifndef PERCPU_ENOUGH_ROOM
        #define PERCPU_ENOUGH_ROOM 32768
        #endif




    So, besides per-CPU variables of vmlinux, the memory area also have room for per-CPU variables of modules.


    The kernel also duplicates NR_CPUS copies of per-CPU variables of modules into that memory area, during module loading time.


    As for details, see:


        /kernel/module.c    -   percpu_modinit() and so on      # well, no enough energy to investigate.




====================================================================================


5.2. impl of dynamic per-CPU variable


====================================================================================


5.2.1. UP version


#
# The API alloc_percpu() is common to both UP / SMP, it is __alloc_percpu() which makes difference.
#


#define alloc_percpu(type) \
    ((type *)(__alloc_percpu(sizeof(type), __alignof__(type))))




#
# UP __alloc_percpu() just call kmalloc(size), to allocate the only one element.
#


static inline void *__alloc_percpu(size_t size, size_t align)
{
    void *ret = kmalloc(size, GFP_KERNEL);
    if (ret)
        memset(ret, 0, size);
    return ret;
}




#
# Correspondingly, UP free_percpu() is also simple, just deallocate the only one element.
#


static inline void free_percpu(const void *ptr)
{   
    kfree(ptr);
}




#
# UP per_cpu_ptr() simply returns the only one element.
#


#define per_cpu_ptr(ptr, cpu) (ptr)




====================================================================================


5.2.2. SMP version


#
# The API alloc_percpu() is common to both UP / SMP, it is __alloc_percpu() which makes difference.
#


#define alloc_percpu(type) \
    ((type *)(__alloc_percpu(sizeof(type), __alignof__(type))))




#
# SMP __alloc_percpu() is also not to allocate "array of NR_CPUS element" for this dynamic per-CPU varaible.
#
# But:
#       Allocate a "percpu_data" intance, which is:
#
#           struct percpu_data {        # represent a dynamic per-CPU variable, but it is internal.
#               void *ptrs[NR_CPUS];
#           };
#
#       Then, allocate per-CPU element from NUMA kmem_cache_alloc_node(), save each element into "percpu_data->ptrs".
#
#       Return the encrypted value of "percpu_data *".
#
#
# So, for SMP dynamic per-CPU variable, its per-CPU elements are also not continugous in RAM, like static one.
#


void *__alloc_percpu(size_t size, size_t align)


    struct percpu_data *pdata = kmalloc(sizeof (*pdata), GFP_KERNEL);


    for (i = 0; i < NR_CPUS; i++) {
        if (!cpu_possible(i))
            continue;
        pdata->ptrs[i] = kmem_cache_alloc_node(
                kmem_find_general_cachep(size, GFP_KERNEL),
                cpu_to_node(i));


        memset(pdata->ptrs[i], 0, size);


    #
    # Note here, we don't simply return the address of "percpu_data", but a encrypted value.
    #
    return (void *) (~(unsigned long) pdata);






#
# Correspondingly, SMP per_cpu_ptr() is to return "percpu_data->ptrs[cpu]".
#


#define per_cpu_ptr(ptr, cpu)                   \
    ({                                              \
            struct percpu_data *__p = (struct percpu_data *)~(unsigned long)(ptr); \
            (__typeof__(ptr))__p->ptrs[(cpu)];  \
    })




#
# And SMP free_percpu() is to deallocate the per-CPU elements in "percpu_data->ptrs[]", and  then "percpu_data" instance.
#


void free_percpu(const void *objp)


    #
    # Decrypt and get the actual address of "percpu_data".
    #
    struct percpu_data *p = (struct percpu_data *) (~(unsigned long) objp);


    for (i = 0; i < NR_CPUS; i++) {
        if (!cpu_possible(i))
            continue;
        kfree(p->ptrs[i]);
    }


    kfree(p);




====================================================================================


6. Misc tips


NONE.
0 0
原创粉丝点击