something behind kmalloc

来源：互联网发布：淘宝网夏季唐装编辑：程序博客网时间：2024/05/23 13:27

starting from the most commonly encountered problem

a compile error

while programming kernel code, the mostly used functions would be kmalloc/kfree as almost all kernel developers are told to not consume too much stacks in the kernel for its limited size(no more than 2 pages at most).

and if not lucky, we'll got error tips while compiling:

/home/ext_first/kernel/net/hello.c: In function 'new_mem':/home/ext_first/kernel/net/hello.c:23:5: error: implicit declaration of function 'kmalloc' [-Werror=implicit-function-declaration]     mem_hello[0] = (unsigned int *)kmalloc(sizeof(unsigned int)*16, GFP_KERNEL);     ^/home/ext_first/kernel/net/hello.c:23:20: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]     mem_hello[0] = (unsigned int *)kmalloc(sizeof(unsigned int)*16, GFP_KERNEL);                    ^/home/ext_first/kernel/net/hello.c:28:20: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]     mem_hello[1] = (unsigned int *)kmalloc(sizeof(unsigned int)*16, GFP_KERNEL);                    ^/home/ext_first/kernel/net/hello.c:31:9: error: implicit declaration of function 'kfree' [-Werror=implicit-function-declaration]         kfree(mem_hello[0]);

for this, there are many footprints from the internet about the solution(you can just try search the key words and see how many by Google/Baidu...), and it is really simple

#include<linux/slab.h>

(just complain a little that CSDN not provide "C" style program code reference sheet)

where is the kmalloc

while if you search the kernel header files, there are more that one "sl<XXX>.h", why just "slab.h" chosen, just because the name seems more simple and nice?

let's search more in details

kernel$ ls include/linux/ | grep -e "^sl.b"slab_def.hslab.hslob_def.hslub_def.h

as the "slab.h" do look simple, just start checking from here firstly.

#include <linux/gfp.h>#include <linux/types.h>#include <linux/workqueue.h>

just include some common headers, "gfp.h" for memory surely and "types.h" for type define. eh, "workqueue.h", looks it really has chance to do some schedule flow.

go on(all following code are basing on kernel VERSION = 3 PATCHLEVEL = 10 SUBLEVEL = 28).

#ifdef CONFIG_SLOB...#define KMALLOC_MAX_SIZE (1UL << 30)#include <linux/slob_def.h>#else /* CONFIG_SLOB */...#ifdef CONFIG_SLAB...#endif#define KMALLOC_MAX_SIZE    (1UL << KMALLOC_SHIFT_MAX)...#ifdef CONFIG_SLAB#include <linux/slab_def.h>#elif defined(CONFIG_SLUB)#include <linux/slub_def.h>#else#error "Unknown slab allocator"#endif

looks fine, "slab.h" would include the other "slXb_X.h" conditionally basing on CONFIG_SLA/U/OB. the CONFIG_SLXBs are defined by kernel config files and after an "make config" like cmds, the final "slxb" is locked down.

why there are so many slxbs? although 3 here, isn't it a little dup?

yeah it is an good question, while a really huge topic i'm not able to cover here :-(. just make it simple: slab is the first to imported from Solaris to have an global control of different types(size) of memory request and makes the memory management more insentient and intelligent. and the following other two just refine it to make it feats more special cases.

SLAB concentrate on cache and makes it benchmark friendly, while SLOB try to make it as compact as possible. SLUB focus on the execution time cost. the interesting part is that SLOB is born earlier than SLAB on linux and now they both be replaced by SLUB, which is really an expectable result of modern people's life.

so seems all clear now.

but wait a minutes, only an "kfree()" function declare found, where is thekmalloc? take it easy, basing on above slab/slob/slub_def header files logic, you'd guess it should be lie in them!

to keep in touch with life, we suppose the final slab system is SLUB(and it is mostly right for currently running linux systems). and see what it looks like for our "kmalloc".

//kernel/mm/Makefile

obj-$(CONFIG_SLUB) += slub.o

//kernel/include/linux/slab_def.h

static __always_inline void *kmalloc(size_t size, gfp_t flags){    struct kmem_cache *cachep;    void *ret;    if (__builtin_constant_p(size)) {        int i;        if (!size)            return ZERO_SIZE_PTR;        if (WARN_ON_ONCE(size > KMALLOC_MAX_SIZE))            return NULL;        i = kmalloc_index(size);#ifdef CONFIG_ZONE_DMA        if (flags & GFP_DMA)            cachep = kmalloc_dma_caches[i];        else#endif            cachep = kmalloc_caches[i];        ret = kmem_cache_alloc_trace(cachep, flags, size);        return ret;    }    return __kmalloc(size, flags);}

despite the familiar "__kmalloc" which looks like an real implementation, seems it tries to request memory firstly by other ways under the condition "__builtin_constant_p".

by check from gcc manual or "Google/Baidu", you'll know it is used for compile time optimize to make it run more quicker. sounds the "__kmalloc" may look the same with the codes in this "if". keep it here and we just go into "__kmalloc" firstly.

below is an reference of the GCC built-in function statement:

— Built-in Function: int __builtin_constant_p (exp)

You can use the built-in function __builtin_constant_p todetermine if a value is known to be constant at compile-time and hencethat GCC can perform constant-folding on expressions involving thatvalue. The argument of the function is the value to test. The functionreturns the integer 1 if the argument is known to be a compile-timeconstant and 0 if it is not known to be a compile-time constant. Areturn of 0 does not indicate that the value isnot a constant,but merely that GCC cannot prove it is a constant with the specifiedvalue of the-O option.

from the caller to the cache

//kernel/mm/slub.c

void *__kmalloc(size_t size, gfp_t flags){    struct kmem_cache *s;    void *ret;    if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))        return kmalloc_large(size, flags);    s = kmalloc_slab(size, flags);    if (unlikely(ZERO_OR_NULL_PTR(s)))        return s;    ret = slab_alloc(s, flags, _RET_IP_);    trace_kmalloc(_RET_IP_, ret, size, s->size, flags);    return ret; }

hymn, not look so much as previously expected. but the flow seems the same: got the targetkmem_cache from size firstly, then fetch the real memory from the cache. the "trace_xxx" functions just work as their name indicate, to trace the "xxx" function while enabled, so you can ignore the trace mechanisms details here.

let's try to find the kmalloc_slab.

strange, not in current "slub.c", not in "linux/slub_def.h"/"linux/slab_def.h" even "linux/slab.h". no in "mm/slab.c"! OK, check the header files in the current C file, an "#include "slab.h"" is found, also be included in "mm/slab.c".

//kernel/mm/slab.h

#ifndef CONFIG_SLOB.../* Find the kmalloc slab corresponding for a certain size */struct kmem_cache *kmalloc_slab(size_t, gfp_t);#endif

just an declaration and under no-SLOB config, that's OK, we would never define SLOB as we are SLUB :-D.

//kernel/mm/slab_common.c

struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags){    int index;    if (size > KMALLOC_MAX_SIZE) {        WARN_ON_ONCE(!(flags & __GFP_NOWARN));        return NULL;    }       if (size <= 192) {        if (!size)            return ZERO_SIZE_PTR;        index = size_index[size_index_elem(size)];    } else        index = fls(size - 1); #ifdef CONFIG_ZONE_DMA    if (unlikely((flags & GFP_DMA)))        return kmalloc_dma_caches[index];#endif    return kmalloc_caches[index];}

now, does it looks like our previous query cache flow in kmalloc "builtin_constant_p" branch :P ! a nice result.

let's back to our original kmalloc to check if the same bahavior between "kmem_cache_alloc_trace" and the left codes in__kmalloc.

#ifdef CONFIG_TRACINGvoid *kmem_cache_alloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size){    void *ret = slab_alloc(s, gfpflags, _RET_IP_);    trace_kmalloc(_RET_IP_, ret, size, s->size, gfpflags);    return ret; }EXPORT_SYMBOL(kmem_cache_alloc_trace);

it looks exactly the same in both slub.c and slab.c! but it seems something strange, what if there is no TRACING config enabled? it sounds really an problem! and here is the answer:

#ifdef CONFIG_TRACINGextern void *kmem_cache_alloc_trace(struct kmem_cache *, gfp_t, size_t);#elsestatic __always_inline void *kmem_cache_alloc_trace(struct kmem_cache *cachep, gfp_t flags, size_t size){    return kmem_cache_alloc(cachep, flags);}#endif

and the "slab_alloc" and "kmem_cache_alloc" in fact looks as below:

static __always_inline void *slab_alloc(struct kmem_cache *s,        gfp_t gfpflags, unsigned long addr){    return slab_alloc_node(s, gfpflags, NUMA_NO_NODE, addr);}void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags){    void *ret = slab_alloc(s, gfpflags, _RET_IP_);    trace_kmem_cache_alloc(_RET_IP_, ret, s->object_size, s->size, gfpflags);    return ret; }EXPORT_SYMBOL(kmem_cache_alloc);

the "trace_kmem_cache_alloc" not related to the memory allocation work but only a type of trace about memory, so you can ignore directly.

to avoid misunderstanding that you think i'm lying, explain the "_RET_IP_" a little. it in fact is an MACRO in linux kernel of another GCC built-in function:

#define  _RET_IP   (unsigned long)__builtin_return_address(0)

it is used to return the current function's return address(here the parameter as 0) or its caller's(using paramter as 1). so you call also find it from the last paramter of "slab_alloc", which has an type of unsigned long and named as "addr". looks more like a trace now!
so an happy ending, till now all our previous thought are correct.
not yet, or maybe i should say it just bring up a new starting. the following is the cache part.

a cache of generic kind

i'm not to help let you know how caches works.

there are so many kind of objects use cache, nearly each commonly used subsystems in the linux kernel create one for their own usage.

and the memory itself is already huge enough to learn. so here we just force on how this "kmalloc" type cache.

from all previous code flow, it gets memory from the "kmalloc_caches" cache at last(the DMA cache maybe a similar kind, so ignore from here). then let's have a look at how it is.

//kernel/include/linux/slab.h

extern struct kmem_cache *kmalloc_caches[KMALLOC_SHIFT_HIGH + 1]; #ifdef CONFIG_ZONE_DMAextern struct kmem_cache *kmalloc_dma_caches[KMALLOC_SHIFT_HIGH + 1]; #endif

seems have relationship with the exact SLXB implementations. still take SLUB type.

we may need to learn some macros here before go ahead.

"KMALLOC_SHIFT_LOW/KMALLOC_SHIFT_HIGH" and "KMALLOC_MIN_SIZE/KMALLOC_MAX_SIZE"

mostly KMALLOC_MIN_SIZE = (1 << KMALLOC_SHIFT_LOW), KMALLOC_MAX_SIZE = (1 << KMALLOC_SHIFT_HIGH)

the meaning are as their names stand for, define the min and max memory sizes kmalloc can meet.

so base on all above info, you can have a thought that the kmalloc_caches may an 2 ordered memory pool from 1 up to2^KMALLOC_SHIFT_HIGH( or KMALLOC_MAX_SIZE). we will prove it later.

the first is "KMALLOC_SHIFT_LOW"

//kernel/include/linux/slab.h

#if defined(ARCH_DMA_MINALIGN) && ARCH_DMA_MINALIGN > 8#define ARCH_KMALLOC_MINALIGN ARCH_DMA_MINALIGN#define KMALLOC_MIN_SIZE ARCH_DMA_MINALIGN#define KMALLOC_SHIFT_LOW ilog2(ARCH_DMA_MINALIGN)#else#define ARCH_KMALLOC_MINALIGN __alignof__(unsigned long long)#endif

for platform that DMA min align is 8 bytes(64bits), the KMALLOC_SHIFT_LOW is not less than 8 bytes(2^3).

this is an initial definition.

SLAB config re-defines it as 5 if even the previous DMA value is not able to reference. details as below:

#ifdef CONFIG_SLAB...#ifndef KMALLOC_SHIFT_LOW#define KMALLOC_SHIFT_LOW   5#endif#else...#ifndef KMALLOC_SHIFT_LOW#define KMALLOC_SHIFT_LOW   3#endif#endif

for SLOB and SLUB config, there is no new define, so it inherent the default definition of above "else" branch directly as 3.

the KMALLOC_SHIFT_HIGH always be associated with page size:

#ifdef CONFIG_SLAB...#define KMALLOC_SHIFT_HIGH  ((MAX_ORDER + PAGE_SHIFT - 1) <= 25 ? \                (MAX_ORDER + PAGE_SHIFT - 1) : 25)#else...#define KMALLOC_SHIFT_HIGH  (PAGE_SHIFT + 1)...#endif

for SLAB config not more than 2^25 and 2 pages at most.

//kernel/mm/slab_common.c

struct kmem_cache *kmalloc_caches[KMALLOC_SHIFT_HIGH + 1]; EXPORT_SYMBOL(kmalloc_caches);.../* * Create the kmalloc array. Some of the regular kmalloc arrays * may already have been created because they were needed to * enable allocations for slab creation. */void __init create_kmalloc_caches(unsigned long flags){    int i;    for (i = 8; i < KMALLOC_MIN_SIZE; i += 8) {......    for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) {        if (!kmalloc_caches[i]) {            kmalloc_caches[i] = create_kmalloc_cache(NULL,                            1 << i, flags);        }        /*         * Caches that are not of the two-to-the-power-of size.         * These have to be created immediately after the         * earlier power of two caches         */        if (KMALLOC_MIN_SIZE <= 32 && !kmalloc_caches[1] && i == 6)            kmalloc_caches[1] = create_kmalloc_cache(NULL, 96, flags);        if (KMALLOC_MIN_SIZE <= 64 && !kmalloc_caches[2] && i == 7)            kmalloc_caches[2] = create_kmalloc_cache(NULL, 192, flags);    }

above function inits the kmalloc_caches arrays from KMALLOC_SHIT_LOW toKMALLOC_SHIFT_HIGH, and each element stand for an cache of two-to-the power-of size except the 6th and 7th.
recall the kmem_cache fetch flow in kmalloc_slab: "index = size_index[size_index_elem(size)]"(notice the preconditions that the size <=192), there do matched!

and the match functions is really simple:

static inline int size_index_elem(size_t bytes){    return (bytes - 1) / 8;}static s8 size_index[24] = {    3,  /* 8 */    4,  /* 16 */    5,  /* 24 */    5,  /* 32 */    6,  /* 40 */    6,  /* 48 */    6,  /* 56 */    6,  /* 64 */    1,  /* 72 */    1,  /* 80 */    1,  /* 88 */    1,  /* 96 */    7,  /* 104 */    7,  /* 112 */    7,  /* 120 */    7,  /* 128 */    2,  /* 136 */    2,  /* 144 */    2,  /* 152 */    2,  /* 160 */    2,  /* 168 */    2,  /* 176 */    2,  /* 184 */    2   /* 192 */};

the size_index above is the initial value, in fact it is initialized in the above "create_kmalloc_caches" before create thekmalloc_caches array.

and you'll find that there are only 24 elements, let's explain later.

void __init create_kmalloc_caches(unsigned long flags){    int i;    for (i = 8; i < KMALLOC_MIN_SIZE; i += 8) {        int elem = size_index_elem(i);        if (elem >= ARRAY_SIZE(size_index))            break;        size_index[elem] = KMALLOC_SHIFT_LOW;    }    if (KMALLOC_MIN_SIZE >= 64) {        /*         * The 96 byte size cache is not used if the alignment         * is 64 byte.         */        for (i = 64 + 8; i <= 96; i += 8)            size_index[size_index_elem(i)] = 7;    }    if (KMALLOC_MIN_SIZE >= 128) {        /*         * The 192 byte sized cache is not used if the alignment         * is 128 byte. Redirect kmalloc to use the 256 byte cache         * instead.         */        for (i = 128 + 8; i <= 192; i += 8)            size_index[size_index_elem(i)] = 8;    }    for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) {        if (!kmalloc_caches[i]) {            kmalloc_caches[i] = create_kmalloc_cache(NULL,                            1 << i, flags);        }

the first loop start from 8, makes all memory request of which the size less than "KMALLOC_MIN_SIZE" as "KMALLOC_SHIFT_LOW".

for example, if the requested memory size is 7 and "KMALLOC_SHIFT_LOW" take as minimal value 3(which means the "KMALLOC_MIN_SIZE" equals 8 bytes), it will be the first element in the size_index, and the target kmalloc_cache would be kmalloc_cache[size_index[0]=3].

and also you'll find that later kmalloc_cache's initialization start fromkmalloc_caches[KMALLOC_SHIFT_LOW] after all size_index be initialized.

the second and third loop refine the size_index according to the KMALLOC_MIN_SIZE: initializesize_index[9..12] as 7(64*2) if KMALLOC_MIN_SIZE>=64, and the third makessize_index[17..24] as 8(128*2) if KMALLOC_MIN_SIZE>=128

initialized size_index[0.23]=[3,4,5,5,6,6,6,6,7,7,7,7,2,2,2,2,2,2,2,2]

after these loops,

if KMALLOC_SHIFT_LOW/KMALLOC_SHIFT_HIGH=3/25, KMALLOC_MIN_SIZE=8

the size_index[0..23]=[3,4,5,5,6,6,6,6,7,7,7,7,2,2,2,2,2,2,2,2], and

the kmalloc_caches[0...KMALLOC_SHIFT_HIGH]=[nl,96,192,2^3,2^4,...,2^24,2^25]

if KMALLOC_SHIFT_LOW/KMALLOC_SHIFT_HIGH=5/25, KMALLOC_MIN_SIZE=32

the size_index[0..23]=[5,5,5,5,6,6,6,6,7,7,7,7,2,2,2,2,2,2,2,2], and

the kmalloc_caches[0...KMALLOC_SHIFT_HIGH]=[nl,96,192,nl,nl,2^5,...,2^24,2^25]

if KMALLOC_SHIFT_LOW/KMALLOC_SHIFT_HIGH=3/13, KMALLOC_MIN_SIZE=8

the size_index[0..23]=[3,4,5,5,6,6,6,6,7,7,7,7,2,2,2,2,2,2,2,2], and

the kmalloc_caches[0...KMALLOC_SHIFT_HIGH]=[nl,96,192,2^3,2^4,...,2^12,2^13]

if KMALLOC_SHIFT_LOW/KMALLOC_SHIFT_HIGH=5/13, KMALLOC_MIN_SIZE=32

the size_index[0..23]=[5,5,5,5,6,6,6,6,7,7,7,7,2,2,2,2,2,2,2,2], and

the kmalloc_caches[0...KMALLOC_SHIFT_HIGH]=[nl,96,192,2^3,2^4,...,2^12,2^13]

from above flow, and check the header files about KMALLOC_SHIFT_LOW/KMALLOC_SHIFT_HIGH, the created cache here are among 2^3~2^25, while the indexable cache fromsize_index are 2^3~2^8(192), so from size_index directly, only 192 bytes is able to be allocated at most.

recall that in the function "kmalloc_slab" it do have two branches, and the size less than 192 fromsize_index directly, while the others from another way.

in fact, there is also comment about this just above the definition of size_index

/* * Conversion table for small slabs sizes / 8 to the index in the * kmalloc array. This is necessary for slabs < 192 since we have non power * of two cache sizes there. The size of larger slabs can be determined using * fls. */static s8 size_index[24] = {

so read info provided anywhere from the source code, some logics maybe more easy to get!

//kernel/mm/slab_common.c

struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags){    int index;...    if (size <= 192) {        if (!size)            return ZERO_SIZE_PTR;        index = size_index[size_index_elem(size)];    } else        index = fls(size - 1);

the "fls" function is linux kernel API to get the position of themost significant set bit, namely, if you have an numeric input of 1025=2^10+1, you'd get 10.

so all 2 ordered kmalloc_chaches are reachable now.

but why needs to separate the index array into two kind of logic? i think the cause should be to saving space by time. the time cost seems cheap while seems not so much memory is saved. supposing for an 32bit machine, it costs 32 bytes at most. and the max memory allocable would not more than 2^25 in fact! that's really an question here.

and still problem here, the max memory allocable is limited by 2^25 at most if all fromkmalloc_caches!

if you travel the kmalloc code more carefully, in fact there is also "KMALLOC_MAX_CACHE_SIZE" which as limitations of a more bigger memory size. and the value of it almost the same asKMALLOC_MAX_SIZE*2^MAX_ORDER. that would really huge!

and a more big size memory request runs into another part

void *__kmalloc(size_t size, gfp_t flags){    struct kmem_cache *s;    void *ret;    if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))        return kmalloc_large(size, flags);

as an unlikely is imported, linux kernel should suppose there should be no such a big eater. any way, let's look forward to the "large" part

static __always_inline void *kmalloc_large(size_t size, gfp_t flags){    unsigned int order = get_order(size);    return kmalloc_order_trace(size, flags, order);}#ifdef CONFIG_TRACINGextern void *kmem_cache_alloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size);extern void *kmalloc_order_trace(size_t size, gfp_t flags, unsigned int order);#elsestatic __always_inline void *kmem_cache_alloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size){    return kmem_cache_alloc(s, gfpflags);}static __always_inline void *kmalloc_order_trace(size_t size, gfp_t flags, unsigned int order){    return kmalloc_order(size, flags, order);}#endif

here just ignore the trace functions as before and keep into thekmalloc_order

static __always_inline void *kmalloc_order(size_t size, gfp_t flags, unsigned int order){    void *ret;    flags |= (__GFP_COMP | __GFP_KMEMCG);    ret = (void *) __get_free_pages(flags, order);    kmemleak_alloc(ret, size, 1, flags);    return ret;}

the too big memory request would not get from preallocated caches but from raw memory directly. that's all OK. for generic memory usage, there should not cache too many big ones, or it would be a waste as not so much request here, some special subsystems may always be hungery and trigger memory reschdule even if they do not fail immediately. if there are too frequently needs to have an big memory, an private cache should be used, or an too big eater should go to the normal memory request way but not from cache. while an too huge and frequent memory request should be an problem of the requester itself to avoid at most as resources are always not infinite.

and one point to makes you keep in mind: our current flow all are based on the fashing SLUB kind, the others may do not have such limitation or design for max memory handle, some may still cache the too huge part while the others may think it should not work at here but call some other API like get_free_pages, that all depends.

so we may get it now, kmalloc maintains an generic common memory caches by 2 ordered size, not too small, and not too big. the request size less than 192 is indexed from an size index array directly mapping to the associated cache, a bigger size request's service cache would be calculated. as it only caches the normal sized memories, too huge sized request would be fetched directly from normal memory zones but not cache. additionally, if you really keeps eating so much memory, i think it is time to review your own code now.

reference:

http://gcc.gnu.org/onlinedocs/gcc-4.4.2/gcc/Other-Builtins.html

http://lxr.free-electrons.com/source/include/linux/kernel.h#L132

http://learning-kernel.readthedocs.org/en/latest/c_skills.html

0 0