synchronization---per-CPU variable

来源：互联网发布：爱淘客网站源码编辑：程序博客网时间：2024/06/05 11:14

====================================================================================

Index:

1. Intro

2. Reference

3. Basic theory

4. APIs
4.1. APIs for static per-CPU variable
4.2. APIs for dynamic per-CPU variable

5. Implementation details, based on kernel 2.6.11.12
5.1. impl of static per-CPU variable
5.1.1. UP version
5.1.2. SMP version
5.2. impl of dynamic per-CPU variable
5.2.1. UP version
5.2.2. SMP version

6. Misc tips

====================================================================================

1. Intro

This doc describles per-CPU variable.

====================================================================================

2. Reference

[1] ulk, ulk - OReilly.Understanding.The.Linux.Kernel.3rd.Edition
//5.2.1. Per-CPU Variables

====================================================================================

3. Basic theory

<<ulk>>
/5.2. Synchronization Primitives
Table 5-2. Various types of synchronization techniques used by the kernel

Technique Description Scope

Per-CPU variables Duplicate a data structure among the CPUs All CPUs

The basic theory of per-CPU variable is:

For per-CPU variables, the kernel arrange them like below:

* variable #0 variable #1 variable #2
* ------------------- ------------------- ------------
* | u0 | u1 | u2 | u3 | | u0 | u1 | u2 | u3 | | u0 | u1 | u
* ------------------- ...... ------------------- .... ------------

A per-CPU variable is in fact a array-like structure, its has NR_CPU elements, each element corresponds to a CPU.
[*] <<ulk>> says each element is aligned to CPU cache line, that is impl detail, we will see.

Then, the code only access the local CPU copy of the per-CPU variable.

per-CPU variable is divided into 2 types:

static per-CPU variable

like simple static variable, it is directly compiled and linked into vmlinux or module.

dynamic per-CPU variable

like simple dynamic variable, it is dynamically allocated in dynamic memory area.

As a synchronization techniqure, per-CPU variable alone is not that riable, consider the following scenario:

task #0, system call service routine is accessing a per-CPU variable local copy on CPU #0.

a HW IRQ issued, hardirq handler interrupts system call service routine, and run.
This hardirq handler wakes up a higher priority task #1.

harirq handler returns.
During IRET, kernel preemption happens, task #1 preempts task #0.

task #1 get to run.

.....

AFTER sometime, task #0 is migrated to other CPU #1, and get scheduled and resumed.

!!!__but now, task #0 is still accessing the per-CPU variable copy of CPU #0, not CPU #1. This causes problem.

So per-CPU variable MUST be used with other synchronization techniques, when accessing per-CPU variables, we need to:

disable preemption

This prevents the scenario above, so task #0 always on CPU #0 during its access to per-CPU variables.

disable softirq # including _lock_bh
disable hardirq # include _lock_irq / lock_irqsave

These implicitly disable preemption.

But additionally, if softirq / hardirq can possibly access a shared per-CPU variable, then, it is needed. In fact, these 2 are rules of using locks. See:
<<kdoc - kernerl-locking>>

[*] Note that, in scenario above, we use system call service routine as a example, but this DOES NOT mean per-CPU variable is only used in "user context", it can be used in ANY context.

====================================================================================

4. APIs

We use different sets of APIs to manipulate static per-CPU variable and dynamic per-CPU variable.

For use per-CPU variable, just:
#include <linux/percpu.h>

Don't include other percpu.h, which are about the architecture-specific implementation details of per-CPU variable.

[*] Note that, the following APIs are from kernel 2.6.11.12, the APIs of recent kernel keep the same, but the implementation changes a lot. For simplicity, we use kernel 2.6.11.12 for description.

====================================================================================

4.1. APIs for static per-CPU variable

#
# DECLARE_PER_CPU()
# is to externly declare a static per-CPU variable, because it uses 'extern' keyword in DECLARE_PER_CPU_SECTION().
#
# It is ususally used for declaring per-CPU variable in header file, or forward declaration in C file.
#
#
# DEFINE_PER_CPU()
# is to define a static per-CPU variable.
#
# It is used in C file.
#

#define DECLARE_PER_CPU(type, name)
#define DEFINE_PER_CPU(type, name)

#
# per_cpu(var, cpu)
# Selects the element for CPU cpu of the per-CPU array name.
#
# Note that, per_cpu() is NOT to retrieve local CPU copy of per-CPU variable, but to retrieve the per-CPU variable copy
# of the specified CPU(!^^__by CPU index).
#
# It should be considered as a internal API, and it is rare to use it in common kernel programming.
#

#define per_cpu(var, cpu) (*RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu]))

#
# __get_cpu_var(var()
# Get the local copy of per-CPU variable(!^^__that is, smp_processor_id() returns the local CPU index, and then, pass
# the local CPU index to per_cpu).
#

#define __get_cpu_var(var) per_cpu(var, smp_processor_id())

#
# get_cpu_var(var)
# Disables kernel preemption, then selects the local CPU's element of the per-CPU array name
#
# put_cpu_var(var)
# Enables kernel preemption (name is not used)
#
#
# As we can see, get_cpu_var() / put_cpu_var() internally disable / enable kernel preemption, so they are most commonly
# used APIs for static per-CPU variable in programming.
#

#define get_cpu_var(var) (*({ preempt_disable(); &__get_cpu_var(var); }))
#define put_cpu_var(var) preempt_enable()

===================================================================================

4.2. APIs for dynamic per-CPU variable

#
# alloc_percpu(type)
# Dynamically allocates a per-CPU array of type data structures and returns its address
#

#define alloc_percpu(type) \
((type *)(__alloc_percpu(sizeof(type), __alignof__(type))))

#
# free_percpu(pointer)
# Releases a dynamically allocated per-CPU array at address pointer
#

static inline void free_percpu(const void *ptr)

#
# per_cpu_ptr(pointer, cpu)
# Returns the address of the element for CPU cpu of the per-CPU array at address pointer
#
# Note that, unlike get_cpu_var() / put_cpu_var(), per_cpu_ptr() does not disable kernel preemption for us, so it is like
# __get_cpu_var(), so when we use per_cpu_ptr(), we need to disable preemption ourself.
#

#define per_cpu_ptr(ptr, cpu) \
({ \
struct percpu_data *__p = (struct percpu_data *)~(unsigned long)(ptr); \
(__typeof__(ptr))__p->ptrs[(cpu)]; \
})

====================================================================================

5. Implementation details, based on kernel 2.6.11.12

Although the theory / sematics / APIs remain the same for per-CPU variable across different kernel version, but in recent kernel, the internal implementation changes a lot, and much more complicated(!^^__same thing happen to workQ...).

For simplicity, here we describes the implementation based on kernel 2.6.11.12, which is enough to understand the internal of per-CPU variable.

====================================================================================

5.1. impl of static per-CPU variable

====================================================================================

5.1.1. UP version

#
# [*] In fact, when we use "name" to define a static per-CPU variable, the name of this per-CPU variable is not
# directly "name", but preappended with a prefix "per_cpu__". This handling is common to UP / SMP.
#
# In UP version, per-CPU variable is just defined simply like regular variables, no special handling, because there is
# only one CPU, so there is only one element, no need to define per-CPU variable as an array.
#

#define DEFINE_PER_CPU(type, name) \
__typeof__(type) per_cpu__##name

#
# So, In UP, per_cpu() and __get_cpu_var() just return the per-CPU variable directly.
#

#define per_cpu(var, cpu) (*((void)cpu, &per_cpu__##var))
#define __get_cpu_var(var) per_cpu__##var

#
# get_cpu_var() / put_cpu_var() are common to UP / SMP, it is the internal __get_cpu_var() they called which makes
# difference.
#
# Note that, even in UP, get_cpu_var() also disable kernel preemption. because it need to avoid the following case:
# task #0 is preempted by task #1, when accessing a per-CPU variable.
# task #1 will also access a same per-CPU variable.
# task #1 scheduled again, it sees a inconsistent view of this per-CPU variable.
#

#define get_cpu_var(var) (*({ preempt_disable(); &__get_cpu_var(var); }))
#define put_cpu_var(var) preempt_enable()

====================================================================================

5.1.2. SMP version

#
# In SMP, DEFINE_PER_CPU() performs some special handling when defining a per-CPU variable.
#
# It uses the section attribute ".data.percpu", then, the per-CPU variable would be compiled and linked into
# ".data.percpu" section of vmlinux or module.
#
#[*] Note that, even for SMP, a per-CPU variable is NOT directly defined as "a array of NR_CPU element", we will
# see how kernel handles this thing soon.
#

#define DEFINE_PER_CPU(type, name) \
__attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu__##name

#
# get_cpu_var() / put_cpu_var() are common to UP / SMP, it is the internal __get_cpu_var() they called which makes
# difference.
#

#define get_cpu_var(var) (*({ preempt_disable(); &__get_cpu_var(var); }))
#define put_cpu_var(var) preempt_enable()

#
# per_cpu() of SMP is different from that of UP, it is:
#
# &"per_cpu__##var" + __per_cpu_offset[cpu]
#

#define __get_cpu_var(var) per_cpu(var, smp_processor_id())
#define per_cpu(var, cpu) (*RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu]))

# define RELOC_HIDE(ptr, off) \
({ unsigned long __ptr; \
__ptr = (unsigned long) (ptr); \
(typeof(ptr)) (__ptr + (off)); })

-----------------------------------------------------------------------------------

#
# setup_per_cpu_areas() is to set up memory area containing static per-CPU variable of vmlinux.
#
@@trace - how kernel handles static per-CPU variable of vmlinux.

start_kernel()

setup_per_cpu_areas();

#
# __per_cpu_start[] and __per_cpu_end[] are 2 linker symbols, defined in:
# /arch/$(arch)/kernel/vmlinux.lds.S - x86, mips, ppc
# like:
# __per_cpu_start = .;
# .data.percpu : { *(.data.percpu) }
# __per_cpu_end = .;
# . = ALIGN(4096);
#
# As we see, they identify the start and the end of ".data.percpu" section, which contains all the
# static per-CPU variables of vmlinux.
#

/* Created by linker magic */
extern char __per_cpu_start[], __per_cpu_end[];

#
# Compute the size of ".data.percpu" section.
#
# Allocate a memory area of "size of .data.percpu" x NR_CPUS, from bootmem allocator.
#
# Copy the content of ".data.percpu" section, into this memory area, duplicated NR_CPUS copy.
# And set __per_cpu_offset[] accordingly.
#

/* Copy section for each CPU (we discard the original) */
size = ALIGN(__per_cpu_end - __per_cpu_start, SMP_CACHE_BYTES);
#ifdef CONFIG_MODULES
if (size < PERCPU_ENOUGH_ROOM)
size = PERCPU_ENOUGH_ROOM;
#endif

ptr = alloc_bootmem(size * NR_CPUS);

for (i = 0; i < NR_CPUS; i++, ptr += size) {
__per_cpu_offset[i] = ptr - __per_cpu_start;
memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
}

-----------------------------------------------------------------------------------

As we can see from above, the memory area allocated by setup_per_cpu_areas(), is in fact of the following layout:

------------------------------ <- __per_cpu_offset[0]

range #0 for CPU #0

------------------------------
SMP_CACHE_BYTES alignment
------------------------------ <- __per_cpu_offset[1]

range #1 for CPU #1

------------------------------
.
.
.
------------------------------
SMP_CACHE_BYTES alignment
------------------------------ <- __per_cpu_offset[N]

range #N for CPU #N

------------------------------

And in theory, a per-CPU variale is an "array of element", but in implementation, the elements are NOT organized continously in RAM, but separated in the different "range", like the following:

------------------------------------------- <- __per_cpu_offset[0]

range #0 for CPU #0

-----------------------------------
CPU #0 copy of per-CPU variable #a
-----------------------------------
CPU #0 copy of per-CPU variable #b
-----------------------------------
CPU #0 copy of per-CPU variable #c
-----------------------------------

-------------------------------------------
SMP_CACHE_BYTES alignment
------------------------------------------- <- __per_cpu_offset[1]

range #1 for CPU #1

-----------------------------------
CPU #1 copy of per-CPU variable #a
-----------------------------------
CPU #1 copy of per-CPU variable #b
-----------------------------------
CPU #1 copy of per-CPU variable #c
-----------------------------------

-------------------------------------------
.
.
.
-------------------------------------------
SMP_CACHE_BYTES alignment
------------------------------------------- <- __per_cpu_offset[N]

range #N for CPU #N

-----------------------------------
CPU #N copy of per-CPU variable #a
-----------------------------------
CPU #N copy of per-CPU variable #b
-----------------------------------
CPU #N copy of per-CPU variable #c
-----------------------------------

-------------------------------------------

In fact ".data.section" of vmlinux will be released after the memory area above has been constructed (!^^__perharps in bootmem allocator retire time), And actually SMP per_cpu() is to return the pointer to the per-CPU range of the memory area, by:

#
# &"per_cpu__##var" + __per_cpu_offset[cpu]
#
# [*] Note that, "per_cpu__##var" is the original address value(!^^__known in compile/link time) of
# per-CPU variable in ".data.section", which is discared. But we never access this address, but just add
# __per_cpu_offset[cpu] to it, to get the actual address value of the copy of per-CPU variable specified,
# in the per-CPU range of the memory area.
#
# [*] So, this is why we don't define a "array of element" in SMP DEFINE_PER_CPU(), and how per_cpu() works.
#

#define per_cpu(var, cpu) (*RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu]))

# define RELOC_HIDE(ptr, off) \
({ unsigned long __ptr; \
__ptr = (unsigned long) (ptr); \
(typeof(ptr)) (__ptr + (off)); })

-----------------------------------------------------------------------------------

[*] How to handle static per-CPU variable in module ???

Because we use the same APIs to access static per-CPU variables in kernel and module, so the module per-CPU variables are also organized into the per-CPU memory range, described by __per_cpu_offset[NR_CPUS].

As we see from:

setup_per_cpu_areas();

/* Copy section for each CPU (we discard the original) */
size = ALIGN(__per_cpu_end - __per_cpu_start, SMP_CACHE_BYTES);
#ifdef CONFIG_MODULES
if (size < PERCPU_ENOUGH_ROOM)
size = PERCPU_ENOUGH_ROOM;
#endif

/* Enough to cover all DEFINE_PER_CPUs in kernel, including modules. */
#ifndef PERCPU_ENOUGH_ROOM
#define PERCPU_ENOUGH_ROOM 32768
#endif

So, besides per-CPU variables of vmlinux, the memory area also have room for per-CPU variables of modules.

The kernel also duplicates NR_CPUS copies of per-CPU variables of modules into that memory area, during module loading time.

As for details, see:

/kernel/module.c - percpu_modinit() and so on # well, no enough energy to investigate.

====================================================================================

5.2. impl of dynamic per-CPU variable

====================================================================================

5.2.1. UP version

#
# The API alloc_percpu() is common to both UP / SMP, it is __alloc_percpu() which makes difference.
#

#define alloc_percpu(type) \
((type *)(__alloc_percpu(sizeof(type), __alignof__(type))))

#
# UP __alloc_percpu() just call kmalloc(size), to allocate the only one element.
#

static inline void *__alloc_percpu(size_t size, size_t align)
{
void *ret = kmalloc(size, GFP_KERNEL);
if (ret)
memset(ret, 0, size);
return ret;
}

#
# Correspondingly, UP free_percpu() is also simple, just deallocate the only one element.
#

static inline void free_percpu(const void *ptr)
{
kfree(ptr);
}

#
# UP per_cpu_ptr() simply returns the only one element.
#

#define per_cpu_ptr(ptr, cpu) (ptr)

====================================================================================

5.2.2. SMP version

#
# The API alloc_percpu() is common to both UP / SMP, it is __alloc_percpu() which makes difference.
#

#define alloc_percpu(type) \
((type *)(__alloc_percpu(sizeof(type), __alignof__(type))))

#
# SMP __alloc_percpu() is also not to allocate "array of NR_CPUS element" for this dynamic per-CPU varaible.
#
# But:
# Allocate a "percpu_data" intance, which is:
#
# struct percpu_data { # represent a dynamic per-CPU variable, but it is internal.
# void *ptrs[NR_CPUS];
# };
#
# Then, allocate per-CPU element from NUMA kmem_cache_alloc_node(), save each element into "percpu_data->ptrs".
#
# Return the encrypted value of "percpu_data *".
#
#
# So, for SMP dynamic per-CPU variable, its per-CPU elements are also not continugous in RAM, like static one.
#

void *__alloc_percpu(size_t size, size_t align)

struct percpu_data *pdata = kmalloc(sizeof (*pdata), GFP_KERNEL);

for (i = 0; i < NR_CPUS; i++) {
if (!cpu_possible(i))
continue;
pdata->ptrs[i] = kmem_cache_alloc_node(
kmem_find_general_cachep(size, GFP_KERNEL),
cpu_to_node(i));

memset(pdata->ptrs[i], 0, size);

#
# Note here, we don't simply return the address of "percpu_data", but a encrypted value.
#
return (void *) (~(unsigned long) pdata);

#
# Correspondingly, SMP per_cpu_ptr() is to return "percpu_data->ptrs[cpu]".
#

#define per_cpu_ptr(ptr, cpu) \
({ \
struct percpu_data *__p = (struct percpu_data *)~(unsigned long)(ptr); \
(__typeof__(ptr))__p->ptrs[(cpu)]; \
})

#
# And SMP free_percpu() is to deallocate the per-CPU elements in "percpu_data->ptrs[]", and then "percpu_data" instance.
#

void free_percpu(const void *objp)

#
# Decrypt and get the actual address of "percpu_data".
#
struct percpu_data *p = (struct percpu_data *) (~(unsigned long) objp);

for (i = 0; i < NR_CPUS; i++) {
if (!cpu_possible(i))
continue;
kfree(p->ptrs[i]);
}

kfree(p);

====================================================================================

6. Misc tips

NONE.

0 0