2.5. Paging in Linux

来源：互联网发布：战舰世界施佩伯爵数据编辑：程序博客网时间：2024/05/21 09:26

Linux adopts a common paging model that fits both 32-bit and 64-bit architectures. As explained in the earlier section "Paging for 64-bit Architectures," two paging levels are sufficient for 32-bit architectures, while 64-bit architectures require a higher number of paging levels. Up to version 2.6.10, the Linux paging model consisted of three paging levels. Starting with version 2.6.11, a four-level paging model has been adopted.^[*] The four types of page tables illustrated in Figure 2-12 are called:

^[*] This change has been made to fully support the linear address bit splitting used by the x86_64 platform (see Table 2-4).

Page Global Directory
Page Upper Directory
Page Middle Directory
Page Table

The Page Global Directory includes the addresses of several Page Upper Directories, which in turn include the addresses of several Page Middle Directories, which in turn include the addresses of several Page Tables. Each Page Table entry points to a page frame. Thus the linear address can be split into up to five parts. Figure 2-12 does not show the bit numbers, because the size of each part depends on the computer architecture.

For 32-bit architectures with no Physical Address Extension, two paging levels are sufficient. Linux essentially eliminates the Page Upper Directory and the Page Middle Directory fields by saying that they contain zero bits. However, the positions of the Page Upper Directory and the Page Middle Directory in the sequence of pointers are kept so that the same code can work on 32-bit and 64-bit architectures. The kernel keeps a position for the Page Upper Directory and the Page Middle Directory by setting the number of entries in them to 1 and mapping these two entries into the proper entry of the Page Global Directory.

Figure 2-12. The Linux paging model

For 32-bit architectures with the Physical Address Extension enabled, three paging levels are used. The Linux's Page Global Directory corresponds to the 80 x 86's Page Directory Pointer Table, the Page Upper Directory is eliminated, the Page Middle Directory corresponds to the 80 x 86's Page Directory, and the Linux's Page Table corresponds to the 80 x 86's Page Table.

Finally, for 64-bit architectures three or four levels of paging are used depending on the linear address bit splitting performed by the hardware (see Table 2-2).

Linux's handling of processes relies heavily on paging. In fact, the automatic translation of linear addresses into physical ones makes the following design objectives feasible:

Assign a different physical address space to each process, ensuring an efficient protection against addressing errors.
Distinguish pages (groups of data) from page frames (physical addresses in main memory). This allows the same page to be stored in a page frame, then saved to disk and later reloaded in a different page frame. This is the basic ingredient of the virtual memory mechanism (see Chapter 17).

In the remaining part of this chapter, we will refer for the sake of concreteness to the paging circuitry used by the 80 x 86 processors.

As we will see in Chapter 9, each process has its own Page Global Directory and its own set of Page Tables. When a process switch occurs (see the section "Process Switch" in Chapter 3), Linux saves the cr3 control register in the descriptor of the process previously in execution and then loads cr3 with the value stored in the descriptor of the process to be executed next. Thus, when the new process resumes its execution on the CPU, the paging unit refers to the correct set of Page Tables.

Mapping linear to physical addresses now becomes a mechanical task, although it is still somewhat complex. The next few sections of this chapter are a rather tedious list of functions and macros that retrieve information the kernel needs to find addresses and manage the tables; most of the functions are one or two lines long. You may want to only skim these sections now, but it is useful to know the role of these functions and macros, because you'll see them often in discussions throughout this book.

2.5.1. The Linear Address Fields

The following macros simplify Page Table handling:

PAGE_SHIFT

Specifies the length in bits of the Offset field; when applied to 80 x 86 processors, it yields the value 12. Because all the addresses in a page must fit in the Offset field, the size of a page on 80 x 86 systems is 2¹² or the familiar 4,096 bytes; the PAGE_SHIFT of 12 can thus be considered the logarithm base 2 of the total page size. This macro is used by PAGE_SIZE to return the size of the page. Finally, the PAGE_MASK macro yields the value 0xfffff000 and is used to mask all the bits of the Offset field.

PMD_SHIFT

The total length in bits of the Offset and Table fields of a linear address; in other words, the logarithm of the size of the area a Page Middle Directory entry can map. The PMD_SIZE macro computes the size of the area mapped by a single entry of the Page Middle Directory that is, of a Page Table. The PMD_MASK macro is used to mask all the bits of the Offset and Table fields.

When PAE is disabled, PMD_SHIFT yields the value 22 (12 from Offset plus 10 from Table), PMD_SIZE yields 2²² or 4 MB, and PMD_MASK yields 0xffc00000. Conversely, when PAE is enabled, PMD_SHIFT yields the value 21 (12 from Offset plus 9 from Table), PMD_SIZE yields 2²¹ or 2 MB, and PMD_MASK yields 0xffe00000.

Large pages do not make use of the last level of page tables, thus LARGE_PAGE_SIZE, which yields the size of a large page, is equal to PMD_SIZE (2PMD_SHIFT) while LARGE_PAGE_MASK, which is used to mask all the bits of the Offset and Table fields in a large page address, is equal to PMD_MASK.

PUD_SHIFT

Determines the logarithm of the size of the area a Page Upper Directory entry can map. The PUD_SIZE macro computes the size of the area mapped by a single entry of the Page Global Directory. The PUD_MASK macro is used to mask all the bits of the Offset, Table, Middle Air, and Upper Air fields.

On the 80 x 86 processors, PUD_SHIFT is always equal to PMD_SHIFT and PUD_SIZE is equal to 4 MB or 2 MB.

PGDIR_SHIFT

Determines the logarithm of the size of the area that a Page Global Directory entry can map. The PGDIR_SIZE macro computes the size of the area mapped by a single entry of the Page Global Directory. The PGDIR_MASK macro is used to mask all the bits of the Offset, Table, Middle Air, and Upper Air fields.

When PAE is disabled, PGDIR_SHIFT yields the value 22 (the same value yielded by PMD_SHIFT and by PUD_SHIFT), PGDIR_SIZE yields 2²² or 4 MB, and PGDIR_MASK yields 0xffc00000. Conversely, when PAE is enabled, PGDIR_SHIFT yields the value 30 (12 from Offset plus 9 from Table plus 9 from Middle Air), PGDIR_SIZE yields 2³⁰ or 1 GB, and PGDIR_MASK yields 0xc0000000.

PTRS_PER_PTE, PTRS_PER_PMD, PTRS_PER_PUD, and PTRS_PER_PGD

Compute the number of entries in the Page Table, Page Middle Directory, Page Upper Directory, and Page Global Directory. They yield the values 1,024, 1, 1, and 1,024, respectively, when PAE is disabled; and the values 512, 512, 1, and 4, respectively, when PAE is enabled.

2.5.2. Page Table Handling

pte_t, pmd_t, pud_t, and pgd_t describe the format of, respectively, a Page Table, a Page Middle Directory, a Page Upper Directory, and a Page Global Directory entry. They are 64-bit data types when PAE is enabled and 32-bit data types otherwise. pgprot_t is another 64-bit (PAE enabled) or 32-bit (PAE disabled) data type that represents the protection flags associated with a single entry.

Five type-conversion macros _ _ pte, _ _ pmd, _ _ pud, _ _ pgd, and _ _ pgprot cast an unsigned integer into the required type. Five other type-conversion macros pte_val, pmd_val, pud_val, pgd_val, and pgprot_val perform the reverse casting from one of the four previously mentioned specialized types into an unsigned integer.

The kernel also provides several macros and functions to read or modify page table entries:

pte_none, pmd_none, pud_none, and pgd_none yield the value 1 if the corresponding entry has the value 0; otherwise, they yield the value 0.
pte_clear, pmd_clear, pud_clear, and pgd_clear clear an entry of the corresponding page table, thus forbidding a process to use the linear addresses mapped by the page table entry. The ptep_get_and_clear( ) function clears a Page Table entry and returns the previous value.
set_pte, set_pmd, set_pud, and set_pgd write a given value into a page table entry; set_pte_atomic is identical to set_pte, but when PAE is enabled it also ensures that the 64-bit value is written atomically.
pte_same(a,b) returns 1 if two Page Table entries a and b refer to the same page and specify the same access privileges, 0 otherwise.
pmd_large(e) returns 1 if the Page Middle Directory entry e refers to a large page (2 MB or 4 MB), 0 otherwise.

The pmd_bad macro is used by functions to check Page Middle Directory entries passed as input parameters. It yields the value 1 if the entry points to a bad Page Table that is, if at least one of the following conditions applies:

The page is not in main memory (Present flag cleared).
The page allows only Read access (Read/Write flag cleared).
Either Accessed or Dirty is cleared (Linux always forces these flags to be set for every existing Page Table).

The pud_bad and pgd_bad macros always yield 0. No pte_bad macro is defined, because it is legal for a Page Table entry to refer to a page that is not present in main memory, not writable, or not accessible at all.

The pte_present macro yields the value 1 if either the Present flag or the Page Size flag of a Page Table entry is equal to 1, the value 0 otherwise. Recall that the Page Size flag in Page Table entries has no meaning for the paging unit of the microprocessor; the kernel, however, marks Present equal to 0 and Page Size equal to 1 for the pages present in main memory but without read, write, or execute privileges. In this way, any access to such pages triggers a Page Fault exception because Present is cleared, and the kernel can detect that the fault is not due to a missing page by checking the value of Page Size.

The pmd_present macro yields the value 1 if the Present flag of the corresponding entry is equal to 1 that is, if the corresponding page or Page Table is loaded in main memory. The pud_present and pgd_present macros always yield the value 1.

The functions listed in Table 2-5 query the current value of any of the flags included in a Page Table entry; with the exception of pte_file(), these functions work properly only on Page Table entries for which pte_present returns 1.

Table 2-5. Page flag reading functions

Function name

Description

pte_user( )

Reads the User/Supervisor flag

pte_read( )

Reads the User/Supervisor flag (pages on the 80 x 86 processor cannot be protected against reading)

pte_write( )

Reads the Read/Write flag

pte_exec( )

Reads the User/Supervisor flag (pages on the 80 x 86 processor cannot be protected against code execution)

pte_dirty( )

Reads the Dirty flag

pte_young( )

Reads the Accessed flag

pte_file( )

Reads the Dirty flag (when the Present flag is cleared and the Dirty flag is set, the page belongs to a non-linear disk file mapping; see Chapter 16)

Another group of functions listed in Table 2-6 sets the value of the flags in a Page Table entry.

Table 2-6. Page flag setting functions

Function name

Description

mk_pte_huge( )

Sets the Page Size and Present flags of a Page Table entry

pte_wrprotect( )

Clears the Read/Write flag

pte_rdprotect( )

Clears the User/Supervisor flag

pte_exprotect( )

Clears the User/Supervisor flag

pte_mkwrite( )

Sets the Read/Write flag

pte_mkread( )

Sets the User/Supervisor flag

pte_mkexec( )

Sets the User/Supervisor flag

pte_mkclean( )

Clears the Dirty flag

pte_mkdirty( )

Sets the Dirty flag

pte_mkold( )

Clears the Accessed flag (makes the page old)

pte_mkyoung( )

Sets the Accessed flag (makes the page young)

pte_modify(p,v)

Sets all access rights in a Page Table entry p to a specified value v

ptep_set_wrprotect( )

Like pte_wrprotect( ), but acts on a pointer to a Page Table entry

ptep_set_access_flags()

If the Dirty flag is set, sets the page's access rights to a specified value and invokes flush_tlb_page() (see the section "Translation Lookaside Buffers (TLB)" later in this chapter)

ptep_mkdirty()

Like pte_mkdirty( ) but acts on a pointer to a Page Table entry

ptep_test_and_clear_dirty( )

Like pte_mkclean( ) but acts on a pointer to a Page Table entry and returns the old value of the flag

ptep_test_and_clear_young( )

Like pte_mkold( ) but acts on a pointer to a Page Table entry and returns the old value of the flag

Now, let's discuss the macros listed in Table 2-7 that combine a page address and a group of protection flags into a page table entry or perform the reverse operation of extracting the page address from a page table entry. Notice that some of these macros refer to a page through the linear address of its "page descriptor" (see the section "Page Descriptors" in Chapter 8) rather than the linear address of the page itself.

Table 2-7. Macros acting on Page Table entries

Macro name

Description

pgd_index(addr)

Yields the index (relative position) of the entry in the Page Global Directory that maps the linear address addr.

pgd_offset(mm, addr)

Receives as parameters the address of a memory descriptor cw (see Chapter 9) and a linear address addr. The macro yields the linear address of the entry in a Page Global Directory that corresponds to the address addr; the Page Global Directory is found through a pointer within the memory descriptor.

pgd_offset_k(addr)

Yields the linear address of the entry in the master kernel Page Global Directory that corresponds to the address addr (see the later section "Kernel Page Tables").

pgd_page(pgd)

Yields the page descriptor address of the page frame containing the Page Upper Directory referred to by the Page Global Directory entry pgd. In a two- or three-level paging system, this macro is equivalent to pud_page() applied to the folded Page Upper Directory entry.

pud_offset(pgd, addr)

Receives as parameters a pointer pgd to a Page Global Directory entry and a linear address addr. The macro yields the linear address of the entry in a Page Upper Directory that corresponds to addr. In a two- or three-level paging system, this macro yields pgd, the address of a Page Global Directory entry.

pud_page(pud)

Yields the linear address of the Page Middle Directory referred to by the Page Upper Directory entry pud. In a two-level paging system, this macro is equivalent to pmd_page() applied to the folded Page Middle Directory entry.

pmd_index(addr)

Yields the index (relative position) of the entry in the Page Middle Directory that maps the linear address addr.

pmd_offset(pud, addr)

Receives as parameters a pointer pud to a Page Upper Directory entry and a linear address addr. The macro yields the address of the entry in a Page Middle Directory that corresponds to addr. In a two-level paging system, it yields pud, the address of a Page Global Directory entry.

pmd_page(pmd)

Yields the page descriptor address of the Page Table referred to by the Page Middle Directory entry pmd. In a two-level paging system, pmd is actually an entry of a Page Global Directory.

mk_pte(p,prot)

Receives as parameters the address of a page descriptor p and a group of access rights prot, and builds the corresponding Page Table entry.

pte_index(addr)

Yields the index (relative position) of the entry in the Page Table that maps the linear address addr.

pte_offset_kernel(dir, addr)

Yields the linear address of the Page Table that corresponds to the linear address addr mapped by the Page Middle Directory dir. Used only on the master kernel page tables (see the later section "Kernel Page Tables").

pte_offset_map(dir, addr)

Receives as parameters a pointer dir to a Page Middle Directory entry and a linear address addr; it yields the linear address of the entry in the Page Table that corresponds to the linear address addr. If the Page Table is kept in high memory, the kernel establishes a temporary kernel mapping (see the section "Kernel Mappings of High-Memory Page Frames" in Chapter 8), to be released by means of pte_unmap. The macros pte_offset_map_nested and pte_unmap_nested are identical, but they use a different temporary kernel mapping.

pte_page(x)

Returns the page descriptor address of the page referenced by the Page Table entry x.

pte_to_pgoff(pte)

Extracts from the content pte of a Page Table entry the file offset corresponding to a page belonging to a non-linear file memory mapping (see the section "Non-Linear Memory Mappings" in Chapter 16).

pgoff_to_pte(offset )

Sets up the content of a Page Table entry for a page belonging to a non-linear file memory mapping.

The last group of functions of this long list was introduced to simplify the creation and deletion of page table entries.

When two-level paging is used, creating or deleting a Page Middle Directory entry is trivial. As we explained earlier in this section, the Page Middle Directory contains a single entry that points to the subordinate Page Table. Thus, the Page Middle Directory entry is the entry within the Page Global Directory, too. When dealing with Page Tables, however, creating an entry may be more complex, because the Page Table that is supposed to contain it might not exist. In such cases, it is necessary to allocate a new page frame, fill it with zeros, and add the entry.

If PAE is enabled, the kernel uses three-level paging. When the kernel creates a new Page Global Directory, it also allocates the four corresponding Page Middle Directories; these are freed only when the parent Page Global Directory is released.

When two or three-level paging is used, the Page Upper Directory entry is always mapped as a single entry within the Page Global Directory.

As usual, the description of the functions listed in Table 2-8 refers to the 80 x 86 architecture.

Table 2-8. Page allocation functions

Function name

Description

pgd_alloc(mm)

Allocates a new Page Global Directory; if PAE is enabled, it also allocates the three children Page Middle Directories that map the User Mode linear addresses. The argument mm (the address of a memory descriptor) is ignored on the 80 x 86 architecture.

pgd_free( pgd)

Releases the Page Global Directory at address pgd; if PAE is enabled, it also releases the three Page Middle Directories that map the User Mode linear addresses.

pud_alloc(mm, pgd, addr)

In a two- or three-level paging system, this function does nothing: it simply returns the linear address of the Page Global Directory entry pgd.

pud_free(x)

In a two- or three-level paging system, this macro does nothing.

pmd_alloc(mm, pud, addr)

Defined so generic three-level paging systems can allocate a new Page Middle Directory for the linear address addr. If PAE is not enabled, the function simply returns the input parameter pud that is, the address of the entry in the Page Global Directory. If PAE is enabled, the function returns the linear address of the Page Middle Directory entry that maps the linear address addr. The argument cw is ignored.

pmd_free(x)

Does nothing, because Page Middle Directories are allocated and deallocated together with their parent Page Global Directory.

pte_alloc_map(mm, pmd, addr)

Receives as parameters the address of a Page Middle Directory entry pmd and a linear address addr, and returns the address of the Page Table entry corresponding to addr. If the Page Middle Directory entry is null, the function allocates a new Page Table by invoking pte_alloc_one( ). If a new Page Table is allocated, the entry corresponding to addr is initialized and the User/Supervisor flag is set. If the Page Table is kept in high memory, the kernel establishes a temporary kernel mapping (see the section "Kernel Mappings of High-Memory Page Frames" in Chapter 8), to be released by pte_unmap.

pte_alloc_kernel(mm, pmd, addr)

If the Page Middle Directory entry pmd associated with the address addr is null, the function allocates a new Page Table. It then returns the linear address of the Page Table entry associated with addr. Used only for master kernel page tables (see the later section "Kernel Page Tables").

pte_free(pte)

Releases the Page Table associated with the pte page descriptor pointer.

pte_free_kernel(pte)

Equivalent to pte_free( ), but used for master kernel page tables.

clear_page_range(mmu, start,end)

Clears the contents of the page tables of a process from linear address start to end by iteratively releasing its Page Tables and clearing the Page Middle Directory entries.

2.5.3. Physical Memory Layout

During the initialization phase the kernel must build a physical addresses map that specifies which physical address ranges are usable by the kernel and which are unavailable (either because they map hardware devices' I/O shared memory or because the corresponding page frames contain BIOS data).

The kernel considers the following page frames as reserved :

Those falling in the unavailable physical address ranges
Those containing the kernel's code and initialized data structures

A page contained in a reserved page frame can never be dynamically assigned or swapped to disk.

As a general rule, the Linux kernel is installed in RAM starting from the physical address 0x00100000 i.e., from the second megabyte. The total number of page frames required depends on how the kernel is configured. A typical configuration yields a kernel that can be loaded in less than 3 MB of RAM.

Why isn't the kernel loaded starting with the first available megabyte of RAM? Well, the PC architecture has several peculiarities that must be taken into account. For example:

Page frame 0 is used by BIOS to store the system hardware configuration detected during the Power-On Self-Test(POST); the BIOS of many laptops, moreover, writes data on this page frame even after the system is initialized.
Physical addresses ranging from 0x000a0000 to 0x000fffff are usually reserved to BIOS routines and to map the internal memory of ISA graphics cards. This area is the well-known hole from 640 KB to 1 MB in all IBM-compatible PCs: the physical addresses exist but they are reserved, and the corresponding page frames cannot be used by the operating system.
Additional page frames within the first megabyte may be reserved by specific computer models. For example, the IBM ThinkPad maps the 0xa0 page frame into the 0x9f one.

In the early stage of the boot sequence (see Appendix A), the kernel queries the BIOS and learns the size of the physical memory. In recent computers, the kernel also invokes a BIOS procedure to build a list of physical address ranges and their corresponding memory types.

Later, the kernel executes the machine_specific_memory_setup( ) function, which builds the physical addresses map (see Table 2-9 for an example). Of course, the kernel builds this table on the basis of the BIOS list, if this is available; otherwise the kernel builds the table following the conservative default setup: all page frames with numbers from 0x9f (LOWMEMSIZE( )) to 0x100 (HIGH_MEMORY) are marked as reserved.

Table 2-9. Example of BIOS-provided physical addresses map

Start

End

Type

0x00000000

0x0009ffff

Usable

0x000f0000

0x000fffff

Reserved

0x00100000

0x07feffff

Usable

0x07ff0000

0x07ff2fff

ACPI data

0x07ff3000

0x07ffffff

ACPI NVS

0xffff0000

0xffffffff

Reserved

A typical configuration for a computer having 128 MB of RAM is shown in Table 2-9. The physical address range from 0x07ff0000 to 0x07ff2fff stores information about the hardware devices of the system written by the BIOS in the POST phase; during the initialization phase, the kernel copies such information in a suitable kernel data structure, and then considers these page frames usable. Conversely, the physical address range of 0x07ff3000 to 0x07ffffff is mapped to ROM chips of the hardware devices. The physical address range starting from 0xffff0000 is marked as reserved, because it is mapped by the hardware to the BIOS's ROM chip (see Appendix A). Notice that the BIOS may not provide information for some physical address ranges (in the table, the range is 0x000a0000 to 0x000effff). To be on the safe side, Linux assumes that such ranges are not usable.

The kernel might not see all physical memory reported by the BIOS: for instance, the kernel can address only 4 GB of RAM if it has not been compiled with PAE support, even if a larger amount of physical memory is actually available. The setup_memory( ) function is invoked right after machine_specific_memory_setup( ): it analyzes the table of physical memory regions and initializes a few variables that describe the kernel's physical memory layout. These variables are shown in Table 2-10.

Table 2-10. Variables describing the kernel's physical memory layout

Variable name

Description

num_physpages

Page frame number of the highest usable page frame

totalram_pages

Total number of usable page frames

min_low_pfn

Page frame number of the first usable page frame after the kernel image in RAM

max_pfn

Page frame number of the last usable page frame

max_low_pfn

Page frame number of the last page frame directly mapped by the kernel (low memory)

totalhigh_pages

Total number of page frames not directly mapped by the kernel (high memory)

highstart_pfn

Page frame number of the first page frame not directly mapped by the kernel

highend_pfn

Page frame number of the last page frame not directly mapped by the kernel

To avoid loading the kernel into groups of noncontiguous page frames, Linux prefers to skip the first megabyte of RAM. Clearly, page frames not reserved by the PC architecture will be used by Linux to store dynamically assigned pages.

Figure 2-13 shows how the first 3 MB of RAM are filled by Linux. We have assumed that the kernel requires less than 3 MB of RAM.

The symbol _text, which corresponds to physical address 0x00100000, denotes the address of the first byte of kernel code. The end of the kernel code is similarly identified by the symbol _etext. Kernel data is divided into two groups: initialized and uninitialized. The initialized data starts right after _etext and ends at _edata. The uninitialized data follows and ends up at _end.

The symbols appearing in the figure are not defined in Linux source code; they are produced while compiling the kernel.^[*]

^[*] You can find the linear address of these symbols in the file System.map, which is created right after the kernel is compiled.

Figure 2-13. The first 768 page frames (3 MB) in Linux 2.6

2.5.4. Process Page Tables

The linear address space of a process is divided into two parts:

Linear addresses from 0x00000000 to 0xbfffffff can be addressed when the process runs in either User or Kernel Mode.
Linear addresses from 0xc0000000 to 0xffffffff can be addressed only when the process runs in Kernel Mode.

When a process runs in User Mode, it issues linear addresses smaller than 0xc0000000; when it runs in Kernel Mode, it is executing kernel code and the linear addresses issued are greater than or equal to 0xc0000000. In some cases, however, the kernel must access the User Mode linear address space to retrieve or store data.

The PAGE_OFFSET macro yields the value 0xc0000000; this is the offset in the linear address space of a process where the kernel lives. In this book, we often refer directly to the number 0xc0000000 instead.

The content of the first entries of the Page Global Directory that map linear addresses lower than 0xc0000000 (the first 768 entries with PAE disabled, or the first 3 entries with PAE enabled) depends on the specific process. Conversely, the remaining entries should be the same for all processes and equal to the corresponding entries of the master kernel Page Global Directory (see the following section).

2.5.5. Kernel Page Tables

The kernel maintains a set of page tables for its own use, rooted at a so-called master kernel Page Global Directory. After system initialization, this set of page tables is never directly used by any process or kernel thread; rather, the highest entries of the master kernel Page Global Directory are the reference model for the corresponding entries of the Page Global Directories of every regular process in the system.

We explain how the kernel ensures that changes to the master kernel Page Global Directory are propagated to the Page Global Directories that are actually used by processes in the section "Linear Addresses of Noncontiguous Memory Areas" in Chapter 8.

We now describe how the kernel initializes its own page tables. This is a two-phase activity. In fact, right after the kernel image is loaded into memory, the CPU is still running in real mode; thus, paging is not enabled.

In the first phase, the kernel creates a limited address space including the kernel's code and data segments, the initial Page Tables, and 128 KB for some dynamic data structures. This minimal address space is just large enough to install the kernel in RAM and to initialize its core data structures.

In the second phase, the kernel takes advantage of all of the existing RAM and sets up the page tables properly. Let us examine how this plan is executed.

2.5.5.1. Provisional kernel Page Tables

A provisional Page Global Directory is initialized statically during kernel compilation, while the provisional Page Tables are initialized by the startup_32( ) assembly language function defined in arch/i386/kernel/head.S . We won't bother mentioning the Page Upper Directories and Page Middle Directories anymore, because they are equated to Page Global Directory entries. PAE support is not enabled at this stage.

The provisional Page Global Directory is contained in the swapper_pg_dir variable. The provisional Page Tables are stored starting from pg0, right after the end of the kernel's uninitialized data segments (symbol _end in Figure 2-13). For the sake of simplicity, let's assume that the kernel's segments, the provisional Page Tables, and the 128 KB memory area fit in the first 8 MB of RAM. In order to map 8 MB of RAM, two Page Tables are required.

The objective of this first phase of paging is to allow these 8 MB of RAM to be easily addressed both in real mode and protected mode. Therefore, the kernel must create a mapping from both the linear addresses 0x00000000 through 0x007fffff and the linear addresses 0xc0000000 through 0xc07fffff into the physical addresses 0x00000000 through 0x007fffff. In other words, the kernel during its first phase of initialization can address the first 8 MB of RAM by either linear addresses identical to the physical ones or 8 MB worth of linear addresses, starting from 0xc0000000.

The Kernel creates the desired mapping by filling all the swapper_pg_dir entries with zeroes, except for entries 0, 1, 0x300 (decimal 768), and 0x301 (decimal 769); the latter two entries span all linear addresses between 0xc0000000 and 0xc07fffff. The 0, 1, 0x300, and 0x301 enTRies are initialized as follows:

The address field of entries 0 and 0x300 is set to the physical address of pg0, while the address field of entries 1 and 0x301 is set to the physical address of the page frame following pg0.
The Present, Read/Write, and User/Supervisor flags are set in all four entries.
The Accessed, Dirty, PCD, PWD, and Page Size flags are cleared in all four entries.

The startup_32( ) assembly language function also enables the paging unit. This is achieved by loading the physical address of swapper_pg_dir into the cr3 control register and by setting the PG flag of the cr0 control register, as shown in the following equivalent code fragment:

    movl $swapper_pg_dir-0xc0000000,%eax    movl %eax,%cr3        /* set the page table pointer.. */    movl %cr0,%eax    orl $0x80000000,%eax    movl %eax,%cr0        /* ..and set paging (PG) bit */

2.5.5.2. Final kernel Page Table when RAM size is less than 896 MB

The final mapping provided by the kernel page tables must transform linear addresses starting from 0xc0000000 into physical addresses starting from 0.

The _ _pa macro is used to convert a linear address starting from PAGE_OFFSET to the corresponding physical address, while the _ _va macro does the reverse.

The master kernel Page Global Directory is still stored in swapper_pg_dir. It is initialized by the paging_init( ) function, which does the following:

Invokes pagetable_init( ) to set up the Page Table entries properly.
Writes the physical address of swapper_pg_dir in the cr3 control register.
If the CPU supports PAE and if the kernel is compiled with PAE support, sets the PAE flag in the cr4 control register.
Invokes _ _flush_tlb_all( ) to invalidate all TLB entries.

The actions performed by pagetable_init( ) depend on both the amount of RAM present and on the CPU model. Let's start with the simplest case. Our computer has less than 896 MB^[*] of RAM, 32-bit physical addresses are sufficient to address all the available RAM, and there is no need to activate the PAE mechanism. (See the earlier section "The Physical Address Extension (PAE) Paging Mechanism.")

^[*] The highest 128 MB of linear addresses are left available for several kinds of mappings (see sections "Fix-Mapped Linear Addresses" later in this chapter and "Linear Addresses of Noncontiguous Memory Areas" in Chapter 8). The kernel address space left for mapping the RAM is thus 1 GB - 128 MB = 896 MB.

The swapper_pg_dir Page Global Directory is reinitialized by a cycle equivalent to the following:

    pgd = swapper_pg_dir + pgd_index(PAGE_OFFSET); /* 768 */    phys_addr = 0x00000000;    while (phys_addr < (max_low_pfn * PAGE_SIZE)) {        pmd = one_md_table_init(pgd); /* returns pgd itself */        set_pmd(pmd, _ _pmd(phys_addr | pgprot_val(_ _pgprot(0x1e3))));        /* 0x1e3 == Present, Accessed, Dirty, Read/Write,                Page Size, Global */                phys_addr += PTRS_PER_PTE * PAGE_SIZE; /* 0x400000 */         ++pgd;  }

We assume that the CPU is a recent 80 x 86 microprocessor supporting 4 MB pages and "global" TLB entries. Notice that the User/Supervisor flags in all Page Global Directory entries referencing linear addresses above 0xc0000000 are cleared, thus denying processes in User Mode access to the kernel address space. Notice also that the Page Size flag is set so that the kernel can address the RAM by making use of large pages (see the section "Extended Paging" earlier in this chapter).

The identity mapping of the first megabytes of physical memory (8 MB in our example) built by the startup_32( ) function is required to complete the initialization phase of the kernel. When this mapping is no longer necessary, the kernel clears the corresponding page table entries by invoking the zap_low_mappings( ) function.

Actually, this description does not state the whole truth. As we'll see in the later section "Fix-Mapped Linear Addresses," the kernel also adjusts the entries of Page Tables corresponding to the "fix-mapped linear addresses ."

2.5.5.3. Final kernel Page Table when RAM size is between 896 MB and 4096 MB

In this case, the RAM cannot be mapped entirely into the kernel linear address space. The best Linux can do during the initialization phase is to map a RAM window of size 896 MB into the kernel linear address space. If a program needs to address other parts of the existing RAM, some other linear address interval must be mapped to the required RAM. This implies changing the value of some page table entries. We'll discuss how this kind of dynamic remapping is done in Chapter 8.

To initialize the Page Global Directory, the kernel uses the same code as in the previous case.

2.5.5.4. Final kernel Page Table when RAM size is more than 4096 MB

Let's now consider kernel Page Table initialization for computers with more than 4 GB; more precisely, we deal with cases in which the following happens:

The CPU model supports Physical Address Extension (PAE ).
The amount of RAM is larger than 4 GB.
The kernel is compiled with PAE support.

Although PAE handles 36-bit physical addresses, linear addresses are still 32-bit addresses. As in the previous case, Linux maps a 896-MB RAM window into the kernel linear address space; the remaining RAM is left unmapped and handled by dynamic remapping, as described in Chapter 8. The main difference with the previous case is that a three-level paging model is used, so the Page Global Directory is initialized by a cycle equivalent to the following:

    pgd_idx = pgd_index(PAGE_OFFSET); /* 3 */    for (i=0; i<pgd_idx; i++)        set_pgd(swapper_pg_dir + i, _ _pgd(_ _pa(empty_zero_page) + 0x001));        /* 0x001 == Present */    pgd = swapper_pg_dir + pgd_idx;    phys_addr = 0x00000000;    for (; i<PTRS_PER_PGD; ++i, ++pgd) {        pmd = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE);        set_pgd(pgd, _ _pgd(_ _pa(pmd) | 0x001)); /* 0x001 == Present */        if (phys_addr < max_low_pfn * PAGE_SIZE)            for (j=0; j < PTRS_PER_PMD /* 512 */                  && phys_addr < max_low_pfn*PAGE_SIZE; ++j) {                set_pmd(pmd, _ _pmd(phys_addr |                               pgprot_val(_ _pgprot(0x1e3))));                /* 0x1e3 == Present, Accessed, Dirty, Read/Write,                        Page Size, Global */                phys_addr += PTRS_PER_PTE * PAGE_SIZE; /* 0x200000 */          }    }    swapper_pg_dir[0] = swapper_pg_dir[pgd_idx];

The kernel initializes the first three entries in the Page Global Directory corresponding to the user linear address space with the address of an empty page (empty_zero_page). The fourth entry is initialized with the address of a Page Middle Directory (pmd) allocated by invoking alloc_bootmem_low_pages( ). The first 448 entries in the Page Middle Directory (there are 512 entries, but the last 64 are reserved for noncontiguous memory allocation; see the section "Noncontiguous Memory Area Management" in Chapter 8) are filled with the physical address of the first 896 MB of RAM.

Notice that all CPU models that support PAE also support large 2-MB pages and global pages. As in the previous cases, whenever possible, Linux uses large pages to reduce the number of Page Tables.

The fourth Page Global Directory entry is then copied into the first entry, so as to mirror the mapping of the low physical memory in the first 896 MB of the linear address space. This mapping is required in order to complete the initialization of SMP systems: when it is no longer necessary, the kernel clears the corresponding page table entries by invoking the zap_low_mappings( ) function, as in the previous cases.

2.5.6. Fix-Mapped Linear Addresses

We saw that the initial part of the fourth gigabyte of kernel linear addresses maps the physical memory of the system. However, at least 128 MB of linear addresses are always left available because the kernel uses them to implement noncontiguous memory allocation and fix-mapped linear addresses.

Noncontiguous memory allocation is just a special way to dynamically allocate and release pages of memory, and is described in the section "Linear Addresses of Noncontiguous Memory Areas" in Chapter 8. In this section, we focus on fix-mapped linear addresses.

Basically, a fix-mapped linear address is a constant linear address like 0xffffc000 whose corresponding physical address does not have to be the linear address minus 0xc000000, but rather a physical address set in an arbitrary way. Thus, each fix-mapped linear address maps one page frame of the physical memory. As we'll see in later chapters, the kernel uses fix-mapped linear addresses instead of pointer variables that never change their value.

Fix-mapped linear addresses are conceptually similar to the linear addresses that map the first 896 MB of RAM. However, a fix-mapped linear address can map any physical address, while the mapping established by the linear addresses in the initial portion of the fourth gigabyte is linear (linear address X maps physical address X-PAGE_OFFSET).

With respect to variable pointers, fix-mapped linear addresses are more efficient. In fact, dereferencing a variable pointer requires one memory access more than dereferencing an immediate constant address. Moreover, checking the value of a variable pointer before dereferencing it is a good programming practice; conversely, the check is never required for a constant linear address.

Each fix-mapped linear address is represented by a small integer index defined in the enum fixed_addresses data structure:

    enum fixed_addresses {        FIX_HOLE,        FIX_VSYSCALL,        FIX_APIC_BASE,        FIX_IO_APIC_BASE_0,        [...]        _ _end_of_fixed_addresses    };

Fix-mapped linear addresses are placed at the end of the fourth gigabyte of linear addresses. The fix_to_virt( ) function computes the constant linear address starting from the index:

    inline unsigned long fix_to_virt(const unsigned int idx)    {    if (idx >= _ _end_of_fixed_addresses)        _ _this_fixmap_does_not_exist( );        return (0xfffff000UL - (idx << PAGE_SHIFT));    }

Let's assume that some kernel function invokes fix_to_virt(FIX_IOAPIC_BASE_0). Because the function is declared as "inline," the C compiler does not generate a call to fix_to_virt( ), but inserts its code in the calling function. Moreover, the check on the index value is never performed at runtime. In fact, FIX_IOAPIC_BASE_0 is a constant equal to 3, so the compiler can cut away the if statement because its condition is false at compile time. Conversely, if the condition is true or the argument of fix_to_virt( ) is not a constant, the compiler issues an error during the linking phase because the symbol _ _this_fixmap_does_not_exist is not defined anywhere. Eventually, the compiler computes 0xfffff000-(3<<PAGE_SHIFT) and replaces the fix_to_virt( ) function call with the constant linear address 0xffffc000.

To associate a physical address with a fix-mapped linear address, the kernel uses the set_fixmap(idx,phys) and set_fixmap_nocache(idx,phys) macros. Both of them initialize the Page Table entry corresponding to the fix_to_virt(idx) linear address with the physical address phys; however, the second function also sets the PCD flag of the Page Table entry, thus disabling the hardware cache when accessing the data in the page frame (see the section "Hardware Cache" earlier in this chapter). Conversely, clear_fixmap(idx) removes the linking between a fix-mapped linear address idx and the physical address.

2.5.7. Handling the Hardware Cache and the TLB

The last topic of memory addressing deals with how the kernel makes an optimal use of the hardware caches. Hardware caches and Translation Lookaside Buffers play a crucial role in boosting the performance of modern computer architectures. Several techniques are used by kernel developers to reduce the number of cache and TLB misses.

2.5.7.1. Handling the hardware cache

As mentioned earlier in this chapter, hardware caches are addressed by cache lines. The L1_CACHE_BYTES macro yields the size of a cache line in bytes. On Intel models earlier than the Pentium 4, the macro yields the value 32; on a Pentium 4, it yields the value 128.

To optimize the cache hit rate, the kernel considers the architecture in making the following decisions.

The most frequently used fields of a data structure are placed at the low offset within the data structure, so they can be cached in the same line.
When allocating a large set of data structures, the kernel tries to store each of them in memory in such a way that all cache lines are used uniformly.

Cache synchronization is performed automatically by the 80 x 86 microprocessors, thus the Linux kernel for this kind of processor does not perform any hardware cache flushing. The kernel does provide, however, cache flushing interfaces for processors that do not synchronize caches.

2.5.7.2. Handling the TLB

Processors cannot synchronize their own TLB cache automatically because it is the kernel, and not the hardware, that decides when a mapping between a linear and a physical address is no longer valid.

Linux 2.6 offers several TLB flush methods that should be applied appropriately, depending on the type of page table change (see Table 2-11).

Table 2-11. Architecture-independent TLB-invalidating methods

Method name

Description

Typically used when

flush_tlb_all

Flushes all TLB entries (including those that refer to global pages, that is, pages whose Global flag is set)

Changing the kernel page table entries

flush_tlb_kernel_range

Flushes all TLB entries in a given range of linear addresses (including those that refer to global pages)

Changing a range of kernel page table entries

flush_tlb

Flushes all TLB entries of the non-global pages owned by the current process

Performing a process switch

flush_tlb_mm

Flushes all TLB entries of the non-global pages owned by a given process

Forking a new process

flush_tlb_range

Flushes the TLB entries corresponding to a linear address interval of a given process

Releasing a linear address interval of a process

flush_tlb_pgtables

Flushes the TLB entries of a given contiguous subset of page tables of a given process

Releasing some page tables of a process

flush_tlb_page

Flushes the TLB of a single Page Table entry of a given process

Processing a Page Fault

Despite the rich set of TLB methods offered by the generic Linux kernel, every microprocessor usually offers a far more restricted set of TLB-invalidating assembly language instructions. In this respect, one of the more flexible hardware platforms is Sun's UltraSPARC. In contrast, Intel microprocessors offers only two TLB-invalidating techniques:

All Pentium models automatically flush the TLB entries relative to non-global pages when a value is loaded into the cr3 register.
In Pentium Pro and later models, the invlpg assembly language instruction invalidates a single TLB entry mapping a given linear address.

Table 2-12 lists the Linux macros that exploit such hardware techniques; these macros are the basic ingredients to implement the architecture-independent methods listed in Table 2-11.

Table 2-12. TLB-invalidating macros for the Intel Pentium Pro and later processors

Macro name

Description

Used by

_ _flush_tlb( )

Rewrites cr3 register back into itself

flush_tlb,

flush_tlb_mm,flush_tlb_range

_ _flush_tlb_global( )

Disables global pages by clearing the PGE flag of cr4, rewrites cr3 register back into itself, and sets again the PGE flag

flush_tlb_all,flush_tlb_kernel_range

_ _flush_tlb_single(addr)

Executes invlpg assembly language instruction with parameter addr

flush_tlb_page

Notice that the flush_tlb_pgtables method is missing from Table 2-12: in the 80 x 86 architecture nothing has to be done when a page table is unlinked from its parent table, thus the function implementing this method is empty.

The architecture-independent TLB-invalidating methods are extended quite simply to multiprocessor systems. The function running on a CPU sends an Interprocessor Interrupt (see "Interprocessor Interrupt Handling" in Chapter 4) to the other CPUs that forces them to execute the proper TLB-invalidating function.

As a general rule, any process switch implies changing the set of active page tables. Local TLB entries relative to the old page tables must be flushed; this is done automatically when the kernel writes the address of the new Page Global Directory into the cr3 control register. The kernel succeeds, however, in avoiding TLB flushes in the following cases:

When performing a process switch between two regular processes that use the same set of page tables (see the section "The schedule( ) Function" in Chapter 7).
When performing a process switch between a regular process and a kernel thread. In fact, we'll see in the section "Memory Descriptor of Kernel Threads" in Chapter 9, that kernel threads do not have their own set of page tables; rather, they use the set of page tables owned by the regular process that was scheduled last for execution on the CPU.

Besides process switches, there are other cases in which the kernel needs to flush some entries in a TLB. For instance, when the kernel assigns a page frame to a User Mode process and stores its physical address into a Page Table entry, it must flush any local TLB entry that refers to the corresponding linear address. On multiprocessor systems, the kernel also must flush the same TLB entry on the CPUs that are using the same set of page tables, if any.

To avoid useless TLB flushing in multiprocessor systems, the kernel uses a technique called lazy TLB mode . The basic idea is the following: if several CPUs are using the same page tables and a TLB entry must be flushed on all of them, then TLB flushing may, in some cases, be delayed on CPUs running kernel threads.

In fact, each kernel thread does not have its own set of page tables; rather, it makes use of the set of page tables belonging to a regular process. However, there is no need to invalidate a TLB entry that refers to a User Mode linear address, because no kernel thread accesses the User Mode address space.^[*]

^[*] By the way, the flush_tlb_all method does not use the lazy TLB mode mechanism; it is usually invoked whenever the kernel modifies a Page Table entry relative to the Kernel Mode address space.

When some CPUs start running a kernel thread, the kernel sets it into lazy TLB mode. When requests are issued to clear some TLB entries, each CPU in lazy TLB mode does not flush the corresponding entries; however, the CPU remembers that its current process is running on a set of page tables whose TLB entries for the User Mode addresses are invalid. As soon as the CPU in lazy TLB mode switches to a regular process with a different set of page tables, the hardware automatically flushes the TLB entries, and the kernel sets the CPU back in non-lazy TLB mode. However, if a CPU in lazy TLB mode switches to a regular process that owns the same set of page tables used by the previously running kernel thread, then any deferred TLB invalidation must be effectively applied by the kernel. This "lazy" invalidation is effectively achieved by flushing all non-global TLB entries of the CPU.

Some extra data structures are needed to implement the lazy TLB mode. The cpu_tlbstate variable is a static array of NR_CPUS structures (the default value for this macro is 32; it denotes the maximum number of CPUs in the system) consisting of an active_mm field pointing to the memory descriptor of the current process (see Chapter 9) and a state flag that can assume only two values: TLBSTATE_OK (non-lazy TLB mode) or TLBSTATE_LAZY (lazy TLB mode). Furthermore, each memory descriptor includes a cpu_vm_mask field that stores the indices of the CPUs that should receive Interprocessor Interrupts related to TLB flushing. This field is meaningful only when the memory descriptor belongs to a process currently in execution.

When a CPU starts executing a kernel thread, the kernel sets the state field of its cpu_tlbstate element to TLBSTATE_LAZY; moreover, the cpu_vm_mask field of the active memory descriptor stores the indices of all CPUs in the system, including the one that is entering in lazy TLB mode. When another CPU wants to invalidate the TLB entries of all CPUs relative to a given set of page tables, it delivers an Interprocessor Interrupt to all CPUs whose indices are included in the cpu_vm_mask field of the corresponding memory descriptor.

When a CPU receives an Interprocessor Interrupt related to TLB flushing and verifies that it affects the set of page tables of its current process, it checks whether the state field of its cpu_tlbstate element is equal to TLBSTATE_LAZY. In this case, the kernel refuses to invalidate the TLB entries and removes the CPU index from the cpu_vm_mask field of the memory descriptor. This has two consequences:

As long as the CPU remains in lazy TLB mode, it will not receive other Interprocessor Interrupts related to TLB flushing.
If the CPU switches to another process that is using the same set of page tables as the kernel thread that is being replaced, the kernel invokes _ _flush_tlb( ) to invalidate all non-global TLB entries of the CPU.