e820 -- retrieve memory map from BIOS

来源:互联网 发布:vb的注释符号 编辑:程序博客网 时间:2024/05/27 12:21

e820 – retrieve memory map from BIOS

What is e820

Definition from Wikipedia:

e820 is shorthand to refer to the facility by which the BIOS of x86-based computer systems reports the memory map to the operating system or boot loader.

It is accessed via the int 15h call, by setting the AX register to value E820 in hexadecimal. It reports which memory address ranges are usable and which are reserved for use by the BIOS.

Let me summarize it:

e820 is a kind of hardware to detects physical memory arrangement
and
software can talk with it by int 15th

This means on x86 architecture, any system what to know the memory layout, it
needs to talk to e820. Different architectures use different mechanism to
detect memory layout. For example, on ppc64, it uses device tree.

e820 Table looks like this

If you are using linux, you could type following command in your terminal:

dmesg | grep e820

Then you will see:

[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable[    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved[    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000003ffdffff] usable[    0.000000] BIOS-e820: [mem 0x000000003ffe0000-0x000000003fffffff] reserved[    0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved[    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved

Yes, it is simple. It is just a table. And each entry represents a memory
range with start, end and type. As you can image, the data structure of the
entry is self explain.

struct e820entry {    __u64 addr; /* start of memory segment */    __u64 size; /* size of memory segment */    __u32 type; /* type of memory segment */} __attribute__((packed));

How linux kernel utilize this

Take a look at the kernel code to see how it works~

First, retrieve the e820 table from BIOS

The function access e820 and retrieves the memory layout is
detect_memory_e820().

static int detect_memory_e820(void){    int count = 0;    struct biosregs ireg, oreg;    struct e820entry *desc = boot_params.e820_map;    static struct e820entry buf; /* static so it is zeroed */    initregs(&ireg);    ireg.ax  = 0xe820;    ireg.cx  = sizeof buf;    ireg.edx = SMAP;    ireg.di  = (size_t)&buf;    /*     * Note: at least one BIOS is known which assumes that the     * buffer pointed to by one e820 call is the same one as     * the previous call, and only changes modified fields.  Therefore,     * we use a temporary buffer and copy the results entry by entry.     *     * This routine deliberately does not try to account for     * ACPI 3+ extended attributes.  This is because there are     * BIOSes in the field which report zero for the valid bit for     * all ranges, and we don't currently make any use of the     * other attribute bits.  Revisit this if we see the extended     * attribute bits deployed in a meaningful way in the future.     */    do {        intcall(0x15, &ireg, &oreg);        ireg.ebx = oreg.ebx; /* for next iteration... */        /* BIOSes which terminate the chain with CF = 1 as opposed           to %ebx = 0 don't always report the SMAP signature on           the final, failing, probe. */        if (oreg.eflags & X86_EFLAGS_CF)            break;        /* Some BIOSes stop returning SMAP in the middle of           the search loop.  We don't know exactly how the BIOS           screwed up the map at that point, we might have a           partial map, the full map, or complete garbage, so           just return failure. */        if (oreg.eax != SMAP) {            count = 0;            break;        }        *desc++ = buf;        count++;    } while (ireg.ebx && count < ARRAY_SIZE(boot_params.e820_map));    return boot_params.e820_entries = count;}

The logic is also simple, it iterates on the BIOS table with int 0x15 and
stores the information to boo_params.e820_map.

Second, store and sort e820 table to kernel area

As we can see from the previous code, the memory layout is stored in
boot_param, which is not a safe place. And the raw table is not sorted. So
kernel copy the range to a safe place and sorts it.

This is done is function default_machine_specific_memory_setup().

char *__init default_machine_specific_memory_setup(void){    char *who = "BIOS-e820";    u32 new_nr;    /*     * Try to copy the BIOS-supplied E820-map.     *     * Otherwise fake a memory map; one section from 0k->640k,     * the next section from 1mb->appropriate_mem_k     */    new_nr = boot_params.e820_entries;    sanitize_e820_map(boot_params.e820_map,            ARRAY_SIZE(boot_params.e820_map),            &new_nr);    boot_params.e820_entries = new_nr;    if (append_e820_map(boot_params.e820_map, boot_params.e820_entries)      < 0) {        u64 mem_size;        /* compare results from other methods and take the greater */        if (boot_params.alt_mem_k            < boot_params.screen_info.ext_mem_k) {            mem_size = boot_params.screen_info.ext_mem_k;            who = "BIOS-88";        } else {            mem_size = boot_params.alt_mem_k;            who = "BIOS-e801";        }        e820.nr_map = 0;        e820_add_region(0, LOWMEMSIZE(), E820_RAM);        e820_add_region(HIGH_MEMORY, mem_size << 10, E820_RAM);    }    /* In case someone cares... */    return who;}

The details are hidden in the following two functions:
1. sanitize_e820_map(), sort e820 table
2. append_e820_map(), copy e820 table in boot_param to safe place

Take a big picture

It is time to see at which stage linux kernel accomplish the upper two tasks.

    /* Retrieve the e820 table from BIOS */    main(), arch/x86/boot/main.c, called from arch/x86/boot/header.S          detect_memory()               detect_memory_e820(), save bios info to boot_params.e820_entries    /* store and sort e820 table to kernel area */    start_kernel()          setup_arch()               setup_memory_map(),                      default_machine_specific_memory_setup(), store and sort e820 table                     e820_print_map()

The final function e820_print_map() is the one who print the message we
grabbed by “dmesg | grep e820”. Well, that’s it~

Some important APIs

I would like to talk more about some APIs of e820, which make e820 an
infrastructure to other part of the kernel. All these functions are defined in
“arch/x86/kernel/e820.c”.

e820_add_region()/e820_remove_range()

These two functions are a pair to manipulate the e820 map. And importantly
they don’t care about the order of the e820 map.

__e820_update_range()

This one update the “type” of a memory range.

For example, from E820_RAM to E820_RESERVED.

e820_search_gap()

This one looks for a “gap” in physical memory, which will be used in device
MMIO.

memblock_x86_fill()

This function fill the memblock layer with e820 input. This is a very
important function which builds a generic memory management layer at the very
beginning of linux kernel boots.

We will touch the memblock in another thread.

Some interesting algorithms

sanitize_e820_map()

This is a function which sorts or sanitize the original e820 map from BIOS.
And, it is interesting.

Use two change_member to represent one e820entry

e820entry                         change_member+-------------------+             +------------------------+|addr, size         |             |start, e820entry        ||                   |     ===>    +------------------------+|type               |             |end,   e820entry        |+-------------------+             +------------------------+

Whole e820entry creates an change_member array and its pointer array

 change_member*[]                               change_member[] +------------------------+                     +------------------------+ |*start1                 |      ------>        |start1, entry1          | +------------------------+                     +------------------------+ |*end1                   |      ------>        |end1,   entry1          | +------------------------+                     +------------------------+ |*start2                 |      ------>        |start2, entry2          | +------------------------+                     +------------------------+ |*end2                   |      ------>        |end2,   entry2          | +------------------------+                     +------------------------+ |*start3                 |      ------>        |start3, entry3          | +------------------------+                     +------------------------+ |*end3                   |      ------>        |end3,   entry3          | +------------------------+                     +------------------------+ |*start4                 |      ------>        |start4, entry4          | +------------------------+                     +------------------------+ |*end4                   |      ------>        |end4,   entry4          | +------------------------+                     +------------------------+

Sort change_member based on the change_member->addr

 change_member*[]                 +------------------------+       |*start1                 |       +------------------------+       |*start2                 |       +------------------------+       |*end2                   |       +------------------------+       |*end1                   |       +------------------------+       |*start3                 |       +------------------------+       |*start4                 |       +------------------------+       |*end4                   |       +------------------------+       |*end3                   |       +------------------------+      

Loop on the Sorted change_member* array

After sorting the change_member*[], when we iterate on it, we get an ordered
value for the whole e820 map.

Let me try to explain this in following format.

      (start1,   start2,   end2,   end1,   start3,   start4,   end4,   end3)

An overlap bucket to track “type”

The magic of the algorithm lies in the tracking change of the “type”. Let’s
take the above sorted change_member*[] as an example. The overlap_list[] would
change like this during the iteration on change_member*[]:

* chgidx == 0     change_member*[]                      overlap_list[]     +------------------------+            +-------------------+ --> |*start1                 |            |*start1            |     +------------------------+            +-------------------+     |*start2                 |           +------------------------+           |*end2                   |           +------------------------+           |*end1                   |           +------------------------+           |*start3                 |           +------------------------+           |*start4                 |           +------------------------+           |*end4                   |           +------------------------+           |*end3                   |           +------------------------+      * chgidx == 1     change_member*[]                      overlap_list[]     +------------------------+            +-------------------+     |*start1                 |            |*start1            |     +------------------------+            +-------------------+ --> |*start2                 |            |*start2            |     +------------------------+            +-------------------+     |*end2                   |           +------------------------+           |*end1                   |           +------------------------+           |*start3                 |           +------------------------+           |*start4                 |           +------------------------+           |*end4                   |           +------------------------+           |*end3                   |           +------------------------+      * chgidx == 2     change_member*[]                      overlap_list[]     +------------------------+            +-------------------+     |*start1                 |            |*start1            |     +------------------------+            +-------------------+     |*start2                 |            |                   |     +------------------------+            +-------------------+ --> |*end2                   |           +------------------------+           |*end1                   |           +------------------------+           |*start3                 |           +------------------------+           |*start4                 |           +------------------------+           |*end4                   |           +------------------------+           |*end3                   |           +------------------------+      

Do you get some idea on how the overlap_list[] works? This is a simple idea,
only two cases for overlap_list[].

On each “start”, it adds it to overlap_list[].
On each “end”, it removes its “start” from overlap_list[].

The “current_type” is the max value in overlap_list[]. And “last_type” is the
“current_type” in last loop. When they are different, it means we face a
boundary in the e820 map. Then, it is time to add a new entry or set the
proper size for the previous one.

Let me try to explain it with this agian.

0     type1      type2     type1   0       type3     type4     type4   type3        |          |         |      |        |         |        |       |        v          v         v      v        v         v        v       v      (start1,   start2,   end2,   end1,   start3,   start4,   end4,   end3)

e820_search_gap()

This function is used to find the biggest gap in e820 map. On x86, the gap is
returned to be used as MMIO region.

The logic is simple.

i = nr_map - 1;                                                             end           last                                                              |<--- gap  --->|      (start1, end2)         (start2, end2)         (start3, end3)i = nr_map - 2;                                      end           last                                       |<--- gap  --->|      (start1, end2)         (start2, end2)         (start3, end3)i = nr_map - 3;                end           last                 |<--- gap  --->|      (start1, end2)         (start2, end2)         (start3, end3)

Each iteration would calculate the “gap” between two memblock region. The
biggest and suitable one would be returned.

By the way, till v4.9, there is a bug in this function. I have pointed it out
in following patch, while Yinghai think this function just be used in one
place and the start_addr and end_addr hard coded, just remove them.

> From: Wei Yang <richard.weiyang@gmail.com>> Date: Sun, 18 Dec 2016 22:16:52 +0800> Subject: [PATCH] x86/e820: fix e820_search_gap() in case the start_addr is a>  gap> > For example, we have a e820 map like this:> >     e820: [mem 000000000000000000-0x000000000000000f] usable>     e820: [mem 0x00000000e0000000-0x00000000e000000f] usable> > The biggest gap should be:> >     [mem 0x00000010-0xdfffffff]> > While if start_addr is passed by a value bigger than 0x10, current version> will tell following range is the biggest gap:> >     [mem 0xe0000010-0xffffffff]> > The patch fixes this by set end to start_addr, when start_addr is bigger> than end.> > Signed-off-by: Wei Yang <richard.weiyang@gmail.com>> --->  arch/x86/kernel/e820.c | 2 +->  1 file changed, 1 insertion(+), 1 deletion(-)> > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c> index f4fb197..1826591 100644> --- a/arch/x86/kernel/e820.c> +++ b/arch/x86/kernel/e820.c> @@ -602,7 +602,7 @@ __init int e820_search_gap(unsigned long *gapstart, unsigned long *gapsize,>       unsigned long long end = start + e820->map[i].size;>  >       if (end < start_addr)> -         continue;> +         end = start_addr;>  >       /*>        * Since "last" is at most 4GB, we know we'll> -- > 2.5.0

So the final patch looks like this make e820_search_gap() static.

0 0
原创粉丝点击