[翻译好文] per-cpu 分配

来源：互联网发布：u盘制作linux安装盘编辑：程序博客网时间：2024/06/16 21:04

OverView

- 一个per-cpu数据给每个possiblecpu提供一个单独的空间，每个cpu访问自己的per-cpu空间

- 总共的内存开销是cpu数量xper-cpu的数据

.在numa系统中与cpu的数量相似，但是更加复杂

-per-cpu没有为每个cpu提供同步数据的api，但是可以访问分配给各自的数据，这样可能会更快，下面是使用该功能的几种场景:

.一个使能cache的内存管理机制中：

a.页分配器中的pcp

b. slub分配器

.在写操作很频繁的性能计数中，如:

a. vmstat

b. networkstatistics

. RCUinfrastructure

. Profiling,Ftrace

. VFS 中的一些部件

下图是一个使用per-cpu计数的一个例子:

特性

- per-cpu功能是在2.6的时候加入主线的

-在SMP系统里面，一旦一个数据加入到per-cpu区域，那么每个cpu都有一个和本cpu相关联的一个数据样本，就更不需要考虑同步的事情

.因为每个cpu只会访问自己的per-cpu区域，所以在多数情况下是不需要锁的

.设计per-cpu的初衷是用于不用锁而快速做写操作???

- 但是在中断与抢占环境下需要加以保护

-在多数情况下，他用来统计网络报文，磁盘、内核对象。因为不用锁的缘故，当系统在每个进行上千次的写操作时变得很快。而相关的统计则可以通过累计各个cpu的数值得到。

-他有效的cache在各个cpu上，从而避免了cache饱和的问题???

-在最初的设计中，per-cpu数据是对齐到cacheline的。(butsince the current kernel declares the per-cpu data according to thetype from the time of declaration, the other per- Changed to use withcpu data. (Declared as read-mostly))

- per-cpu 数据定义在percpu的地址空间而非内核地址空间，sparse静态代码分析工具可以用来过滤域入侵错误。

-在2009年的时候引入动态分配的方法，从而引入的chunk方法。

. 之前per-cpu段的内容分配在每个cpu的内存你，然后变成了chunk模式

-不像其他chunks，第一个chunk有3个区域:

. kernelimage中的per-cpusection，就是说静态的per-cpu数据

. module中的per-cpusection，就是其他的一些静态数据

. 动态分配的per-cpu数据

其他的chunk仅管理动态分配的区域。

- 在2014年9月的时候asynchronouschunk population 被加入到3.18-rc1（https://lwn.net/Articles/609300/）

.除了第一个chunk之外其他chunk只能在实际分配的数据区域使用这个页的时候才建立映射，因此降低

了内存与cpu的使用量

. 参考:

a. Percpu:implement asynchronous chunk population

b. Percpu:implmeent pcpu_nr_empty_pop_pages and chunk-> nr_populated

Unit

- per-cpu数据中对应一个cpu的数据称为unit

- unit 被分为3个存储区域来管理:

. 静态区

a.使用DEFINE_PER_CPU()宏定义，在启动的时候确定

b. Rpi2:Yes) 0x3ec0

. 预留区

a. 使用DEFINE_PER_CPU()宏定义,且在module中使用

b. Rpi2:default 0x2000 (8K)

. 动态分配区

a.由alloc_percpu动态添加

b. default0x5000 (20K)

-在64bit系统上默认是28k

-在14年9月之前,分别为12k(32bit系统)20k(64bit系统)，比现在值要小

- unit的大小=静态区大小+预留区大小+动态区大小对其到页单元(？？？怎么对其)

.pcpu_unit_pages：一个unit中使用的页的数量,Rpi2:ex) 11

.pcpu_unit_size: 一个unit包括了多少个byte:Rpi2:Yes) 0xb000 (44KB = 11 * 4KB (PAGE_SIZE))

-如果你现在一个4k页的系统(如arm)上分配一个per-cpu区域，会准确对齐到你使用的页上

. 加上static_size+reserverd_size+dyn_size然后4k对其，多出来的空间给动态分配?

. Rpi2:ex) 0x3ec0 + 0x2000 + 0x5000 + 0x140 (4K align to add the remainingspace to the dynamic area)

- 在一些像x86-64bit系统上，页的大小可以为2M，一个allocation可以存放多个unit:

.放置unit时会在每个unit后面留出一些空间。这个在内存分配的时候是使用不到的??

下面这个图展示了当首次被创建的时候per-cpu被分为3个区域。同时也展示了当只有动态分配区的时候，unit的大小与有其他per-cpu区的unit大小是一样的:

下图展示了分配的页分为2中类型：小与大(big size为2M-就是一个alloction的大小?，upa: unit per alloction)

CPU-> Unit的映射

- 因为per-cpu数据使用数组来代表NR_CPUSs，如果你使用了大于实际cpu的个数，那么就会浪费内存

-为了减小内存的浪费，当系统配置了很大的NR_CPUS以支持多达上千的cpu，而实际的cpu个数确很小，很多架构相关的patch提供了这种空间上的优化:

. 详见: cpualloc v1: Optimize by removing arrays of pointers to per cpuobjects | LWN.net

-在NUMA系统中，如果每个node上的cpu数不一样且使用的是大页(2M?)每个alloction的unit数是一样的.

-当cpu和unit一直的同等的映射的时候，在numa系统cpu->unit采用的是非线性(sparse)映射.

.增加一个cpu->unit映射数组使得实施和查找匹配的units变得更加容易

.当配置了不对称(asymmetric)numa系统时,unit的数量和cpu的数量就不相等了，因为有些unit没有映射

- pcpu_unit_map[]

. cpu id映射到unitindex

.这个映射对于所有的chunk都是有效的

. Ex)0,1,2,3

-pcpu_unit_offsets[]

.包含了每个unit到chunk的偏移(offset)

. 对于所有的chunk这个unitoffset是一样的

. Example)0, 0x8000, 0x10000, 0x18000

如下展示了一个numa系统中unit数映射到可用的cpu数中(16个unit对应12个possible cpu，而NR_CPUS=32)(번호:编号，索引，)

下图展示了各个unit起始地址与最低地址的unit的起始地址之间的偏移

下图显示了配置的各个组(node)信息- 2个node，每个node各有2个cpu

Chunk

- 所有的unit形成一个chunk

- 一个chunk被分成nr_units（个unit）

- 首个chunk:

.你在内核初始化阶段创建的第一个chunk被称之为firstchunk

. firstchunk的内个unit被分成：静态、预留、动态区

- 非firstchunk的unit只有一个动态区

-所有的chunk通过pcpu_chunk结构体管理

-在numa系统中，一个chunk的分配使用的是各个组(node)的内存

下图显示一个chunk分配在各个node(每个node有这个chunk的一部分)

如下图所示，pcpu_base_addr指向了per-cpu区域最低(第一个)chunk的基地址(시스템:系统，

)

Per-CPU的初始化

-静态的per-cpu数据在编译的时候被存储在.data..percpu段中，然后通过2中方法为每个区域中的每个unit进行拷贝、配置.

.当kernel初始化的时候，他配置了firstchunk。然后把编译内核中各个静态的per-cpu数据从储存其

数据的percpusection中读出来，拷贝到firstchunk的所有unit的静态区.

.同样模块的静态per-cpu数据也是编译(到一个section？)，当内核实时加载模块进来的时候，会读取其

存储的percpusection，然后增加到firstchunk所有unit的reserved区中

- 根据内核的配置不同，firstchunk的创建时使用的per-cpuchunk分配方法分为了2种:

1) Embed method

.如果架构支持，你可以（在ZONE_NORMAL配置）使用大页(如x86的2M)以提供TLB缓存的

效率

.在UMA系统中，firstchunk就进行一次的分配、使用(这个是相对numa而言的吧，numa需要在各

个node上分配内存，所以需要多次的分配操作?)

.在NUMA系统中，firstchunk被分割、分配在每个node的内存上

.如果你增加一个chunk，你应该把vmalloc空间配置成”从top到down的分配方式”，从而使之与first

chunk创建时node的配置一致???

-在32bit的numa系统，vmallc是很小的(arm=240M,x86=120M)

a)为了考虑额外chunks的vmalloc分配空间，分配在各个node的chunks的基地址的距离不能超

过vmalloc所有地址空间的75%-

- 见percpu:make embedding first chunk allocator check vmalloc space size

b)如果不能通过embedmethod创建，那么会自动采用pagemethod来重试.

.从top到down来搜索vmalloc的空闲空间

a)当使用vmalloc()或者vmap()来分配vmalloc空间，我们从bottom到top来使用vmalloc空间，

而per-cpu刚好相反，以尽可能防止彼此之间的重叠

b)numa系统中的per-cpuvmalloc分配,是通过在各个node放置一个相等的baseoffset,以避免通

用vmalloc区的重叠，它可以快速的创建，减少失败的可能性....?????????

2) Paged methon

. 当firstchunk 配置时，物理页被分配从起始点开始映射到vmalloc区的最小页单元

1)因此在kernel初始化的时候slub分配器还没ready，所以映射到vmalloc区是不可能的，所以使

用的是早期注册方法，这样slub初始化完了之后可以将其注册到对应的区域

2)在slub分配器激活之后会调用percpu_init_late()来读取所有分配的chunk映射且分配他们

.我们没有把chunk分到各个nodes的vmalloc区?但是一次尽可能多的配置units???

.因为他并不使用大页(2M),性能可能会比embed类型要差，但是他在32bitvmalloc区很小的的numa

系统中使用毫无问题

.从bottom到top搜索空闲的vmalloc区

1)你可以通过vmalloc()/vmap()来分配vmalloc区，因为你不需要通过node数来组织chunk

- Demand Paging （按需分配?）

. first chunk在kernel初始化的时候就分配了chunk中所有的page并完成映射，但是增加的chunk使

用的是demandpaging

.增加的chunk在slub分配器初始化完之后进行，通常使用DemandPaging.他会配置pcpu_chunk结

构体、为chunk区分配一个vmalloc空间。但是没有分配页，因此也没有建立映射

.他作为一个bitmap来管理，通过对应的bit时候设置来chunk中的页时候有配置(映射建立)?

1)当分配使用一个per-cpu数据时，如果页对应的populatebitmap没有设置，那么就分配一个物理

页，并使用vmap()建立映射

2)populated 的页数量通过nr_populated成员变量来管理

下图展示了静态per-cpu数据申明、内核初始化时拷贝到各个unit的静态区、模块加载时拷贝到firstchunk的各个unit的预留区中:

下图显示uma系统使用embedmethod时firstchunk放置的位置:

下图显示uma系统使用embedmethod时增加的chunk放置的位置:

下图显示numa系统使用Embedmethod创建firstchunk时first chunk的放置位置

下图显示numa系统使用embedmethod时增加的chunk放置的位置:

=================================================

下图显示firstchunk通过pagedmethod创建时的位置:

下图显示增加的chunk通过pagedmethod创建时的位置

下图显示如果通过this_cpu_ptr()函数访问静态per-cpu数据:

下图显示通过this_cpu_ptr()函数访问动态的per-cpu数据

为firstchunk管理mapentire

管理firstchunk的映射

-对于一个在每个group和unit的位置起重要作用的chunk上做管理是很特殊的????

- 对于firstchunk管理的chunk映射可能包括一个或者两个???

.pcpu_first_chunk 管理动态分配的空间，就像其他任何chunk一样，增加到pcpu_slot[]

.pcpu_reserved_chunk 没有加到pcpu_slot[]中区去，他是在module管理机制中使用

-当在内核中使用module，pcpu_reserved_chunk全局变量用于管理firstchunk中reverse区的映射entry

.内核模块加载时，使用DEFINE_PER_CPU()宏来添加per-cpu数据到reserve区

.如果内核模块有很多的静态per-cpu数据，那么需要增加预留区的大小，并重新编译内核

下图显示有内核模块与无内核模块情况下，firstchunk的两个管理结构是如何工作的

per-cpu区域存储的段（Per-cpuarea storage section）

- 当per-cpu区域载入内存时，__per_cpu_load为一个虚拟地址

-为了提高性能，提供了很多宏来在不同区域存储per-cpu数据

- Cacheline= 64B (rpi2: L1 d-cache cacheline)

- arm 32bit –PAGE_SIZE=4kb

你可以从上图看到per-cpu段是放在.data段下面，我们也可以通过分析vmlinux.lds.S链接脚本分析:

arch/arm/kernel/vmlinux.lds.S

229 #ifdef CONFIG_SMP

230 PERCPU_SECTION(L1_CACHE_BYTES)

231 #endif

include/asm-generic/vmlinux.lds.h

810 /**

811 * PERCPU_SECTION - define output section for percpu area, simpleversion

812 * @cacheline: cacheline size

813 *

814 * Align to PAGE_SIZE and outputs output section for percpu area. This

815 * macro doesn't manipulate @vaddr or @phdr and __per_cpu_loadand

816 * __per_cpu_start will be identical.

817 *

818 * This macro is equivalent to ALIGN(PAGE_SIZE);PERCPU_VADDR(@cacheline,,)

819 * except that __per_cpu_load is defined as a relative symbolagainst

820 * .data..percpu which is required for relocatable x86_32configuration.

821 */

822 #define PERCPU_SECTION(cacheline) \

823 . = ALIGN(PAGE_SIZE); \

824 .data..percpu : AT(ADDR(.data..percpu) - LOAD_OFFSET) { \

825 VMLINUX_SYMBOL(__per_cpu_load) = .; \

826 PERCPU_INPUT(cacheline) \

827 }

这个段定义了per-cpu区域输出的段

- LOAD_OFFSET:asm-generic里定义的值为0，且arm架构没有对其进行修改

-__per_cpu_load: .data..percpu段的起始地址，通过符号指定(在module中使用)

756 /**

757 * PERCPU_INPUT - the percpu input sections

758 * @cacheline: cacheline size

759 *

760 * The core percpu section names and core symbols which do notrely

761 * directly upon load addresses.

762 *

763 * @cacheline is used to align subsections to avoid falsecacheline

764 * sharing between subsections for different purposes.

765 */

766 #define PERCPU_INPUT(cacheline) \

767 VMLINUX_SYMBOL(__per_cpu_start) = .; \

768 *(.data..percpu..first) \

769 . = ALIGN(PAGE_SIZE); \

770 *(.data..percpu..page_aligned) \

771 . = ALIGN(cacheline); \

772 *(.data..percpu..read_mostly) \

773 . = ALIGN(cacheline); \

774 *(.data..percpu) \

775 *(.data..percpu..shared_aligned) \

776 VMLINUX_SYMBOL(__per_cpu_end) = .;

-这个定义了per-cpu区域的输入段

动态申请的api

- alloc_percpu()

. 申请指定类型(如int)大小的per-cpu数据

129 #define alloc_percpu(type) \

130 (typeof(type) __percpu *)__alloc_percpu(sizeof(type), \

131 __alignof__(type))

- __alloc_percpu()

. percpu数据被分配对其到指定的大小

1077 void __percpu *__alloc_percpu(size_t size, size_t align)

1078 {

1079 return pcpu_alloc(size, align, false, GFP_KERNEL);

1080 }

reserved区使用的api

- DEFINE_PER_CPU()

.你可以通过上面宏，使用per-cpu变量，然后通过下面的宏访问

- DECLARE_PER_CPU()

.因为per-cpu数据使用的变量放在不同的位置，我们使用下述宏定义为进行外部声明

1)例如你要在模块中使用内核中像vmstat这样的静态变量

112 #define DECLARE_PER_CPU(type, name) \

113 DECLARE_PER_CPU_SECTION(type, name, "")

100 #define DECLARE_PER_CPU_SECTION(type, name, sec) \

101 extern __PCPU_ATTRS(sec) __typeof__(type) name

Per-cpu数据访问api

-因为我们设计了内核抢占，所以必须用合适的api来访问per-cpu数据

-查找per-cpu对象需要如下的一些内存访问

.smp_processor_id()

.per-cpu对象数组的基地址

. per-cpu对象指针的指针数组

. per-cpu对象本身

1) 1-value 操作

-使用get_cpu_var函数来增加per-cpu数据，如下面例子所示，在使用完之后需要调用put_cpu_var()

.get_cpu_var(socket_in_use) ++;

.put_cpu_var(socket_in_use);

- get_cpu_var()

. 在SMP系统会用preempt_disable()来获取值

. 参数把必须是一个1-value

258 /*

259 * Must be an lvalue. Since @var must be a simple identifier,

260 * we force a syntax error here if it isn't.

261 */

262 #define get_cpu_var(var) \

263 (*({ \

264 preempt_disable(); \

265 this_cpu_ptr(&var); \

266 }))

- put_cpu_var()

.在SMP系统中写值，然后调用preempt_enble()

268 /*

269 * The weird & is necessary because sparse considers(void)(var) to be

270 * a direct dereference of percpu variable (var).

271 */

272 #define put_cpu_var(var) \

273 do { \

274 (void)&(var); \

275 preempt_enable(); \

276 } while (0)

2) 指针操作

如果你需要像如下例子那样访问指针，你首先需要知道当前cpu的id，使用per_cpu_ptr()函数找到指针，然后使用这个指针去做用户动作，你必须在使用完per-cpu数据之后put_cpu()

int cpu;

cpu = get_cpu();

prt = per_cpu_ptr(per_cpu_var, cpu);

/* work with the ptr */

put_cpu();

- get_cpu()

. 关抢占，然后返回cpu id

.如果你使用了这个函数，那么必须在使用完之后使用put_cpu()重新开抢占

#defineget_cpu() ({ preempt_disable(); smp_processor_id(); })

- smp_processor_id()

.如果使能了CONFIG_DEBUG_PREEMPT，会在抢占已经开启的情况下打印告警

. 返回当前进程的cpuid.

/*

 *smp_processor_id(): get the current CPU ID.

 *if DEBUG_PREEMPT is enabled then we check whether it is

 *used in a preemption-safe way. (smp_processor_id() is safe

 *if it's used in a preemption-off critical section, or in

 *a thread that is bound to the current CPU.)

 *NOTE: raw_smp_processor_id() is for internal use only

 *(smp_processor_id() is the preferred variant), but in rare

 *instances it might also be used to turn off false positives

 *(i.e. smp_processor_id() use that the debugging code reports but

 *which use for some reason is legal). Don't use this to hack around

 *the warning message, as your code might not work under PREEMPT.

*/

#ifdef CONFIG_DEBUG_PREEMPT

  externunsigned int debug_smp_processor_id(void);

# define smp_processor_id()debug_smp_processor_id()

#else

# define smp_processor_id()raw_smp_processor_id()

#endif

#define raw_smp_processor_id() (current_thread_info()->cpu)

- this_cpu_ptr()

.如果使能CONFIG_DEBUG_PREEMPT，为了sparse静态代码检测，使用__verify_pcpu_ptr()来检

查ptr(指针)是否ok。然后返回ptr+my_cpu_offset的地址

.如果CONFIG_DEBUG_PREEMPT没有使能，直接调用raw_cpu_ptr()返回ptr+当前cpu偏移的

地址

#ifdef CONFIG_DEBUG_PREEMPT

#define this_cpu_ptr(ptr) \

({ \

__verify_pcpu_ptr(ptr); \

SHIFT_PERCPU_PTR(ptr, my_cpu_offset); \

})

#else

#define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)

#endif

-__verify_pcpu_ptr()

.这个宏定义主要是为了sparse静态代码检查工作，用来检查ptr指针是否是指向了一个per-cpu地址

空间

* __verify_pcpu_ptr() verifies @ptr is a percpu pointer withoutevaluating

* @ptr and is invoked once before a percpu area is accessed by all

* accessors and operations. This is performed in the generic partof

* percpu and arch overrides don't need to worry about it; however,if an

* arch wants to implement an arch-specific percpu accessor oroperation,

* it may use __verify_pcpu_ptr() to verify the parameters.

* + 0 is required in order to convert the pointer type from a

* potential array type to a pointer to a single item of the array.

#define __verify_pcpu_ptr(ptr) \

do { \

const void __percpu *__vpp_verify = (typeof((ptr) + 0))NULL; \

(void)__vpp_verify; \

} while (0)

- SHIFT_PERCPU_PTR()

.简单地把__offset加到__p，然后强制转换内核地址空间并返回

* Add an offset to a pointer but keep the pointer as-is. UseRELOC_HIDE()

* to prevent the compiler from making incorrect assumptions aboutthe

* pointer value. The weird cast keeps both GCC and sparse happy.

#define SHIFT_PERCPU_PTR(__p, __offset) \

RELOC_HIDE((typeof(*(__p)) __kernel __force *)(__p),(__offset))

- RELOC_HIDE()

. 返回__ptr+off的地址

.下面的注释说:这个宏定义使得这个...换句话说，不管变量的类型，他只是加了address,加了offset，然

后转换成原来的类型。

.对于gcc-4.1之前的编译器，在ppc64的架构上有个bug，通过使用这个W/A可以发现这个bug

* This macro obfuscates arithmetic on a variable address so that gcc

* shouldn't recognize the original var, and make assumptions aboutit.

* This is needed because the C standard makes it undefined to do

* pointer arithmetic on "objects" outside their boundariesand the

* gcc optimizers assume this is the case. In particular they

* assume such arithmetic does not wrap.

* A miscompilation has been observed because of this on PPC.

* To work around it we hide the relationship of the pointer and theobject

* using this macro.

* Versions of the ppc64 compiler before 4.1 had a bug where use of

* RELOC_HIDE could trash r30. The bug can be worked around bychanging

* the inline assembly constraint from =g to =r, in this particular

* case either is valid.

#define RELOC_HIDE(ptr, off) \

({ unsigned long __ptr; \

__asm__ ("" : "=r"(__ptr) : "0"(ptr)); \

(typeof(ptr)) (__ptr + (off)); })

- raw_cpu_ptr()

.为了sparse静态代码检查，使用__verify_pcpu_ptr()来检查ptr指针是否ok，然后调用arch_raw_cpu_ptr()宏函数返回ptr的地址加上当前cpu的offset.

* Arch may define arch_raw_cpu_ptr() to provide more efficientaddress

* translations for raw_cpu_ptr().

#ifndef arch_raw_cpu_ptr

#define arch_raw_cpu_ptr(ptr) SHIFT_PERCPU_PTR(ptr, __my_cpu_offset)

#endif

- arch_raw_cpu_ptr()

. 返回ptr的地址+_my_cpu_offset.

* Arch may define arch_raw_cpu_ptr() to provide more efficientaddress

* translations for raw_cpu_ptr().

#ifndef arch_raw_cpu_ptr

#define arch_raw_cpu_ptr(ptr) SHIFT_PERCPU_PTR(ptr, __my_cpu_offset)

#endif

- __my_cpu_offset()

.返回与cpu相关的offset，该值存储在TPIDRPRW寄存中

1)这个值用于访问per-cpu数据相对于per-cpu ptr地址

2) arm架构使用没用的cp15的TPIDRPRW寄存器来存储per-cpuoffset，以此来提升per-cpu的

操作系能

.arch/arm/include/asm/percpu.h

#define __my_cpu_offset __my_cpu_offset()

static inline unsigned long __my_cpu_offset(void)

{

unsigned long off;

* Read TPIDRPRW.

* We want to allow caching the value, so avoid usingvolatile and

* instead use a fake stack read to hazard against barrier().

asm("mrc p15, 0, %0, c13, c0, 4" : "=r"(off)

: "Q" (*(const unsigned long*)current_stack_pointer));

return off;

}

- put_cpu()

. 使能抢占

#define put_cpu() preempt_enable()

- per_cpu()

.通过访问特定cpu的per-cpu变量，来获取指针

#define per_cpu(var, cpu) (*per_cpu_ptr(&(var), cpu))

- per_cpu_ptr()

. 当使用sparse静态代码检查工具时，通过_verify_pcpu_ptr()来检查ptr是否ok，然后返回ptr的地址

+对应cpu的offset

#define per_cpu_ptr(ptr, cpu) \

({ \

__verify_pcpu_ptr(ptr); \

SHIFT_PERCPU_PTR((ptr), per_cpu_offset((cpu))); \

})

- per_cpu_offset()

.__per_cpu_offset[]: 为了访问per-cpu数据，你需要加上对应cpu号的offset，而这个offset存在这个

数组里面

* per_cpu_offset() is the offset that has to be added to a

* percpu variable to get to the instance for a certain processor.

* Most arches use the __per_cpu_offset array for those offsets but

* some arches have their own ways of determining the offset (x86_64,s390).

#ifndef __per_cpu_offset

extern unsigned long __per_cpu_offset[NR_CPUS];

#define per_cpu_offset(x) (__per_cpu_offset[x])

#endif

其他关键的函数

- __percpu 宏定义

. attributes:指定给sparse静态检查工具的参数

.address_space(3)：per-cpu指定要用的address_space

. node_ref:指示不能使用指针引用的数据，即你不能直接使用正常的指针值“*ptr”

# define __percpu __attribute__((noderef, address_space(3)))

-pcpu_addr_in_first_chunk()

.返回给定的值是否属于firstchunk的范围

-pcpu_addr_in_reserved_chunk()

. 返回给定的地址是否在静态区和reserved区.

. 如果没有reservedchunk，就是没有reserved区，pcpu_reserved_chunk_limit的值为0.

-获取给定chunk号，cpu号，页号的页起始地址

使用per-cpu数据来统计slub对象的例子:

-include/linux/slub_def.h

* Slab cache management.

struct kmem_cache {

struct kmem_cache_cpu __percpu *cpu_slab;

struct kmem_cache_cpu {

void **freelist; /* Pointer to next available object*/

unsigned long tid; /* Globally unique transaction id */

struct page *page; /* The slab from which we areallocating */

struct page *partial; /* Partially allocated frozen slabs*/

#ifdef CONFIG_SLUB_STATS

unsigned stat[NR_SLUB_STAT_ITEMS];

#endif

};

static inline void flush_slab(struct kmem_cache *s, structkmem_cache_cpu *c)

{

stat(s, CPUSLAB_FLUSH);

deactivate_slab(s, c->page, c->freelist);

c->tid = next_tid(c->tid);

c->page = NULL;

c->freelist = NULL;

}

static inline void stat(const struct kmem_cache *s, enum stat_itemsi)

{

#ifdef CONFIG_SLUB_STATS

* The rmw is racy on a preemptible kernel but this isacceptable, so

* avoid this_cpu_add()'s irq-disable overhead.

raw_cpu_inc(s->cpu_slab->stat[si]);

#endif

}

-include/linux/slub_def.h

enum stat_item {

ALLOC_FASTPATH, /* Allocation from cpu slab */

ALLOC_SLOWPATH, /* Allocation by getting a new cpuslab */

FREE_FASTPATH, /* Free to cpu slab */

FREE_SLOWPATH, /* Freeing not to cpu slab */

FREE_FROZEN, /* Freeing to frozen slab */

FREE_ADD_PARTIAL, /* Freeing moves slab to partial list*/

FREE_REMOVE_PARTIAL, /* Freeing removes last object */

ALLOC_FROM_PARTIAL, /* Cpu slab acquired from nodepartial list */

ALLOC_SLAB, /* Cpu slab acquired from pageallocator */

ALLOC_REFILL, /* Refill cpu slab from slab freelist*/

ALLOC_NODE_MISMATCH, /* Switching cpu slab */

FREE_SLAB, /* Slab freed to the page allocator*/

CPUSLAB_FLUSH, /* Abandoning of the cpu slab */

(....)

-include/asm-generic/percpu.h

#ifndef raw_cpu_add_4

#define raw_cpu_add_4(pcp, val) raw_cpu_generic_to_op(pcp,val, +=)

#endif

#define raw_cpu_generic_to_op(pcp, val, op) \

do { \

*raw_cpu_ptr(&(pcp)) op val; \

} while (0)

Per-cpu相关的数据结构

- pcpu_chunk 结构体

struct pcpu_chunk {

struct list_head list; /* linked topcpu_slot lists */

int free_size; /* free bytes in thechunk */

int contig_hint; /* max contiguoussize hint */

void *base_addr; /* base address ofthis chunk */

int map_used; /* # of map entriesused before the sentry */

int map_alloc; /* # of map entriesallocated */

int *map; /* allocation map */

struct work_struct map_extend_work;/* async ->map[]extension */

void *data; /* chunk data */

int first_free; /* no free below this*/

bool immutable; /* no [de]populationallowed */

int nr_populated; /* # of populatedpages */

unsigned long populated[]; /* populated bitmap*/

};

. list

(1)pcpu_slot map的链表，用于管理chunk

(2) 所有的chunk都会连接到pcpu_slot map中，用于管理动态分配空间，他们从chunk中分配，

而该链表是按照空闲空间的多少排序的

. free_size

(1)chunk中空闲空间的大小

(2)当分配和释放的时候，free_size大小改变，然后根据这个free_size移到合适的slot

. contig_hint

(1)chunk最大的连续的空闲空间

(2) 当firstchunk创建的时候，contig_hint与free_size的大小是一样的

(3)在这个chunk分配大于这个值的数据是不可能的

. base_addr

(1)分配的chunk的最小的起始虚拟地址

(2)在numa系统中，因此一个chunk的各个unit分布在各个node上，anallocation area corresponding to the number of nodes is used.

.通常是unit0的起始虚拟地址，当然也可能不是unit0的起始地址，这个取决于cpu->unit映射

. 每个node有一个lowest起始的虚拟地址

(3)对于uma系统，一个chunk只使用一个分配区域，只赋予了一个起始虚拟地址

. map_used

(1) 指示chunk中的map[]数组的长度.

(2) map_use+1 指示与最后一个地址offset entry的个数?如：map_used=4,使用4+1个entry

.map_alloc

(1)这是本chunk中map[]数组最大的item数(开始时初始化为128，后面如果不否的话就进行扩展)

. map[]

(1) map[]数组是一个可变的数组，动态的扩展

.如果map[]被扩展，map_alloc也会相应的增加

.在早期内核分配器运行的时候，初始化为PERCPU_DYNAMIC_EARLY_SLOT个元素

.在slub分配器初始化后，如果需要更多的map[],这个数组会增长

(2) map数组值的距离

.如果区别旧数组的值:一个正数值为一个unit中的空闲空间；负数值为这个已经使用了的空间

. 新的数组值分隔方法在2014年引入v3.15

- 详见percpu:store offsets instead of lengths in -> map []

-每个byte是一个size，且使用了一个与sizeof(int)对其的数值。但是他被用于其他size的

end，因此使用的需要被cut?????最后一个指示分配状态:1 = in use 状态，0=空闲状态

-eg: 4bytes free，8 in use, 4 in use, 4 free, 12 in use, 100 free,total unit大小为132,

map_used = 3，那么:

. 旧的map[]= {4, -8, -4, 4, -12, 100, 0,}

. 新的map[]= {0, 5, 13, 16, 21, 32, 133, 0,}

<0,0>, <4,1>, <12,1>, <16,0>, <20,1>,<32,0>, <132,1> -use flag>:在实际存储中

，offset的以bit用于存储in-use标记。

0 -3: 4 byte 空闲，4-11: 8byte in-use, 12-15: 4 in-use, 16-19: 4free, 20-31: 12 in-use

32-131: 100 free

. data

(1) Theaddress of the page structure pointer array that points to the pagesto which the chunk is

allocated is stored.

. immutable

(1)指示这个chunk是否能被改变

. nr_populated

(1) chunk中实际分配、映射的物理页，这个是实际使用的页的个数

. populated[]

(1) 该数组用于管理页分配器在哪里，而非在chunk中的区域

(2) firstchunk所有的页都被设置，因为实际的页都已经分配并映射

(3) 增加的chunk在实际分配使用正在的空间前被设置为0，(因此此时只是使用了vmalloc的空间，

而非分配真正的物理页)

- pcpu_group_info结构体

struct pcpu_group_info {

int nr_units; /* aligned # of units*/

unsigned long base_offset; /* base addressoffset */

unsigned int *cpu_map; /* unit->cpu map,empty

* entries containNR_CPUS */

};

.numa系统中，各个node被作为一个group组织起来，每个group管理对应node的cpu。因为uma系

统只有一个node，他只使用一个group

. nr_unit: 这个group使用的unit的个数

. base_offset:这个group使用的base_offset地址

. cpu_map[]:这个group使用的unit->cpu的映射

.上述3个item同样存在于全局变量中，但是全局变量记录的是所有的unit和cpu

.这个结构的信息在setup_per_cpu_areas()函数完成per-cpu初始化之后就不能用了

- pcpu_alloc_info结构体

struct pcpu_alloc_info {

size_t static_size;

size_t reserved_size;

size_t dyn_size;

size_t unit_size;

size_t atom_size;

size_t alloc_size;

size_t __ai_size; /* internal, don'tuse */

int nr_groups; /* 0 if groupingunnecessary */

struct pcpu_group_info groups[];

};

. 该结构体在setup_per_cpu_areas()函数初始化per-cpu过程中，作为一个局部变量使用

. 全局变量应该被使用，因为初始化完了之后，他们的信息就丢失了，如映射时使用的unit->cpu

. static_size: 一个unit中静态区的大小,如：0x3ec0

. reserved_size:在配置了CONFIG_MODULE的内核中，在一个unit中你可以使用的reserved区大小

，如8k.

. dyn_size: 各个架构的unit的动态区大小是不一样的。在arm架构上，这个值比以前要大一点。基本上

是20k(32bit系统)与28k(64bit系统)是必要的。如果static_size+reserved_size+dyn_size做4k对其之

后还有剩余，那么剩余的也加入到这个dyn_size当中。Rpi2:0x5140 (初始值为32K,0x5000)

. atom_size: 最小分配的size，对于arm来说是4k

. alloc_size: 要分配的大小。Rpi2:0xb000 x 4 = 0x3_c000

.nr_groups: group(node)的数量。Rpi2:1

关键的全局变量

. pcpu_unit_pages: 一个unit使用的页的个数。Rpi2:Yes) 0xb (1 unit page = 11 pages)

.pcpu_unit_size: 一个unit的大小(byte).Rpi2: ex) 0xb000 (one unit size = 44K)

.pcpu_nr_units: unit的总个数。Rpi2:4

.pcpu_atom_size: 分配中页的大小，arm是4k，在有些平台如x86-64bitnuma系统使用的是2。

0x1000 (4K)

.pcpu_nr_slots: 管理的slot的个数。

- The slot number corresponding to the unit size isreplaced with 2

- Rpi2: If the unit size is 44K, the slot number is 13, soadding 2 adds pcpu_nr_slots = 15

.pcpu_chunk_struct_size: 当创建firstchunk的时候这个值被设置为:chunk结构体的大小+populated的bitmap的位数(取决于unit的大小)

.pcpu_low_unit_cpu: The base number at the bottom of baseaddr。Rpi2:0

.pcpu_high_unit_cpu: The cpu number at the top ofbase_addr。Rpi2:3 (这两个变量应该对应才是)

.pcpu_base_addr: first chunk被分配的起始虚拟地址，在numa系统中，会有lowest地址的个数与group的个数一致。

.pcpu_unit_map []：Cpu -> unit映射。Rpi2: {0, 1, 2, 3}

.pcpu_unit_offsets []: 这个数组包含了每个unit的偏移。Rpi2:{0, 0xb000, 0x1_6000, 0x2_1000}

.pcpu_nr_groups: group的个数，就是numanode个数。Rpi2:1

.pcpu_group_offsets []: 一个包含每个group offset的数组。Rpi2:0

.pcpu_group_sizes []: 一个存储每个group大小的数组。Rpi2:{0x2_c000}

.pcpu_first_chunk：指向一个pcpu_chunk,该指针用于管理first chunk的动态区域映射

.pcpu_reserved_chunk: 也指向一个pcpu_chunk，该指针用于管理firstchunk中的reserved区，当没有使用模块的时候，指针为NULL。

.pcpu_reserved_chunk_limit:如果reserved区存在的话，+static_size+reserved_size

.pcpu_slot []：

- pcpu_nr_slots 管理这个数组的大小，每个数组元素有个list_head指向对应的

- 最后一个slot为空chunk的slot。

- 这个slot数组用于管理动态区的chunk

. Rpi2: 以static_size= 0x3ec0, reserved_size = 0x2000, dyn_size = 0x5140为例

. first chunk的Free_size= 0x5140 , slot为第12th slot.

. reserved chunk 的free_size= 0x5ec0,但是reservedchunk是在模块加载的时候还分配，

所以没有在pcpu_slot[]中管理(它只管理动态区)

. 因此，当第一个per-cpu数据初始化的时候，pcpu_slot[12]配置成一串first chunk，且一个

空的chunk加在slot后面????

.pcpu_nr_empty_pop_pages:增加的chunk中有多少个页还没有populated(仅管理动态区)。此时，reservedchunk管理firstchunk的reserved区，它并不受影响。

.pcpu_async_enabled:如果允许异步释放per-cpu区域，那么会调用pcpu_balance_workfn()函数，随后会调度释放。他在内核初始化的initcall中使能。

.pcpu_atomic_alloc_failed: 如果atomic分配失败的，将其设置为true

参考

Setup_per_cpu_areas() | Doorc

Per-cpudynamic allocation | Doorc

Per-CPUVariables

Robustper_cpu allocation for modules | LWN.net

Per-CPUmemory management (1) | F/ OSS

Per-CPUmemory management (2) | F/ OSS

Documentation/ preempt-locking.txt | Kernel.org

Betterper-CPU variables | LWN.net

Whatevery programmer should know about memory, Part 1 | LWN.net

ARM:implement optimized percpu variable access | LWN.net

翻译的原文:http://jake.dothome.co.kr/per-cpu/

- 翻译的不是很通俗易懂，后续待改进吧。

- 原创翻译，转载请注明出处！

阅读全文

0 0