MMU paging

来源：互联网发布：手机淘宝登录首页登录编辑：程序博客网时间：2024/06/05 08:40

MMU paging

Memory Management: Paging

Paul Krzyzanowski

October 20, 2010 [updated March 23, 2011]

The advantage of a bad memory is that one enjoys several times the same good things for the first time.
— Friedrich Nietzsche

Introduction/Review

This is a continuation of our discussion on memory management.

A memory management unit that supports paging causes every logical address (virtual address) to be translated to aphysical address(real address) by translating the logical page number of the addressto a physical page frame number. The page number comprises the high bitsof the address address. The low bits of the address form the offset within the page (or page frame).For example, with a 64-bit address, if a page size is 1 MB, then the lowest 20 bits (which address 1 MB) form the offset and the top 44 bits form the page number.

Thanks to a memory management unit, every process can have its own address space. The same virtual address may reside in two different page frames for two different processes because each process hasits own page table. The operating system is responsible for managing the page table for each process.During a context switch, the operating system has to inform the processor's memory management unit that it has touse a different page table. It does this by changing thepage table base register, aregister that contains the starting address of the page table.

Each page table entry, in addition to storing the corresponding page frame,may also store a variety of flags that indicate if the page is valid (is there a corresponding page frame),how the page may be accessed (read-only, execute, kernel-mode only), and whether the page has been accessed ormodified.

Two optimizations are typically present in memory management units: one optimizes lookup time and the otheroptimizes the space used by the page table.

If every address translation required a lookup in a page table then our memory access performance would betwice as slow since we'd need to read the page table in addition to accessing the memory location that we want.This overhead is reduced by atranslation lookaside buffer, or TLB, whichcaches frequently-used page table entries in its associative memory.
Because most processes use only a small fraction of the available virtual address space, there will be largeregions of a page table that will contain no entries. To keep the size of a page table more manageable, a multi-level structure is often used. The page bits of a virtual address are split into two parts: the highbits define the offset into a top-level index table. This index table contains base registers for partial pagetables. The low bits of the page bits define the offset within that partial page table.

Logical and physical addressing modes

Most processors support operation in two modes: virtualaddressing mode, which uses address translation throughthe memory management unit, andphysical addressing mode,which bypasses the MMU. In this latter mode, every memory reference isa reference to the actual physical memory location in the memory system.Upon power-on, a CPU will start up in physical addressing mode sincepage tables have not been set up yet. The system will usually remainin physical addressing mode while the operating system is loaded and run to initialize the page table. The operating system will then switchthe processor to virtual addressing mode.

System memory map

In today's operating system, the view of memory is split into two logical regions: kernelmemory and process memory. Kernel memory is the memory that is reserved for the operatingsystem kernel's code, data, and stack. Its location in the virtual address space remains constant,typically in the top part of the address space. On 32-bit Linux systems, kernel memoryis configurable (PAGE_OFFSET) and is generally set to the top 1 GB of the address space.On Microsoft Windows 7 systems, the top 2 GB of the address space is dedicated to kernel memoryon 32-bit systems (8TB on 64-bit systems).

Process memory represents the remainder of the address space and is available to user processes.In physical memory, page frames will be allocated to various processes. In virtual memory,however, we are looking at memory from the viewpoint of thecontext of theprocess. The view of memory is defined by the page table for the process and contains mappingsto the pages relevant to that specific process as well as mappings to the kernel.

When a mode switch occurs, either because of an interrupt or system call, the current pagetable does not change: we experience a mode switch and not a context switch. Execution istransferred to the kernel, which can, if it needs to, access memory in the context ofthe process that generated the mode switch. A page table will generally be set up to disallowuser-mode execution access to regions of the address space that are mapped to kernel memory.Most memory management units allow one to set kernel-only access permissions to a page.

Page allocator and kernel memory management

Under virtual memory, processes can freely use discontiguous pages as they need more and morememory. The page tables in the memory management unit (MMU) can provide the illusion of contiguityby making the virtual address space contiguous. The kernel, on the other hand, may need physicallycontiguous memory buffers.

The Linux operating system maintains an array called free_area of lists of free pages.The first element of the array is a list of free single-page regions; the second element is a listof free two-page regions (free adjacent pages); the third element is a list of free four-pageregions; etc.

The buddy algorithm is a memory allocation and management algorithm thatmanages memory in power of two increments. This brief discussion presents only the basiclists of blocks of free chunks of memory. buddy algorithm; there are a number of variations on it. lists of blocks of free chunks of memory.

A memory manager using the Buddy System keeps lists of free blocksthat are sizes of powers of 2 (2, 4, 8, 16, 32, …). Initially, whenall of memory is free, all lists are empty except for the largestpower of two that is less than or equal to the size of allocatablememory. When a block of size n is needed, the algorithm checks thelist for the nearest power of two that is greater than or equal ton. If there is one, all is well and the block can be marked as used.If there is no free block, the algorithm gets a block from the nextlevel, splits it into two buddies (which get recorded in the previous level of the list), and uses one of those forallocation. When the block is freed again, the buddies can be combinedand brought up to the next level in the list. If the next level inthe list does not have an available block, the process is repeatedwith successively bigger-sized blocks in the list.

For example, suppose we need a 42K byte segment of memory. Theclosest bigger power of two is 64K, so we request a 64K byte segment.Suppose that all we have is one free 1M byte segment. The algorithmlooks for a 64K byte segment. It's not there, so it attempts to geta 128K byte segment to split into two 64K buddies. That doesn'texist either, so a 512K byte segment is needed. That, too, does notexist, so we split the 1M byte segment into two 512K byte buddies.One of these segments is split into two 128K buddies and one ofthose is split into two 64K buddies. This algorithm is fast andmakes merging very easy. Unfortunately, because it requires allallocation units to be powers of two, it is wasteful of memory.Unused memory within an allocation unit is known asinternalfragmentation.Unused memory between allocation units is known asexternal fragmentation.

Case studies

We will examine the memory management units available on two oftoday's most popular processor architectures:the ARMv7-A and the Intel 32-bit and 64-bit architectures. Together,these represent the processors used on the vast bulk of devices inthe world, ranging from cell phones to supercomputers. These studieswill not go into the details of the architecture but rather focuson the basic approaches taken and on some of theunique characteristics.

ARMv7-A MMU Architecture

ARM is a 32-bit reduced instruction set computer (RISC) architecture developed and owned by ARM Holdings. The processorsare licensed to manufacturers who may augment them with custom DSPs (digital signal processors), radios, or graphicsprocessors. The processors are used in most cell phones and gaming consoles. The ARMv7-A architecture that we'll beexamining is present in the Cortex-A8 and Cortex-A9 processors. The Cortex-A8 is used in Motorola Droid phones,the iPad, and the iPhone (3GS and 4), among numerous other devices.The Cortex-A9 is used in Apple's A5 processor that powers the iPad 2.Recentprevious generations of ARM processors are not significantly different in terms of their MMU architecture so much of this discussion can apply to them as well. References to the ARM processor, MMU, or ARM architecture in the rest of thissection will refer to the ARMv7-A architecture.

Figure 1. ARM sections and pages

Sections and pages

The ARM MMU supports four page sizes. The largest sizes are called sections and the smaller sizes arecalledpages:

Supersections: 16 MB memory blocks (24-bit offsets)
Sections: 1 MB memory blocks (20-bit offsets)
Large pages: 64 KB pages (16-bit offsets)
Small pages: 4 KB pages (12-bit offsets)

The MMU supports a two-level hierarchy for its page table structure. An entry in thefirst-level tablecontains either a pointer to a second-level tables (partial page tables)or a base address of a section or a supersection. Hence, if weuse the really big pages — sections and supersections — then we don't have to go through two levelsof hierarchy.The benefit of sections and supersections is that you can have a large region of memory, such as the operating system, mapped using just a single entry in the TLB. The MMU can be configured to use eithersmall or large pages as well as sections or supersections. Sections (or supersections) can be mixed togetherwith pages.Just because the architecture supports mapping blocks of several different sizesdoes not mean that the operating system will use the capability. Doing so introduces the problems associated withvariable size partitions that we discussed earlier. However, even if this is not used for general-purpose memorymanagement, it makes a lot of sense for mapping the operating system address space efficiently[1].

Translation Lookaside Buffers (TLB)

Figure 2. ARM memory translation

The ARM has two levels of TLBs. The smallest and fastest is the MicroTLB. There is a MicroTLBfor the instruction and data sides of the CPU (instruction fetches uses one MicroTLB while data read/write operations use the other). The MicroTLB can store 32 entries [2]. The cache is fullyassociative and can perform a lookup in one clock cycle. Hence, there is no performance penalty for any memoryreferences that are satisfied by the MicroTLB. The architecture supports the use of anaddress space identifier (ASID) to allow the operating system toidentify one process' address space from another's without flushing the cache. Entries can also be tagged asglobal; so that they are shared among all address spaces. This is, of course, useful for mapping thememory regions used by the operating system. Each entry contains a number of protection bits and these arechecked at each address lookup. If the protections disallow the requested memory operation (e.g., no-execute orread-only) then the MMU will signal a Data Abort, which will cause a trap. On a cache miss, the replacement algorithmmay be selected to be either round-robin (the default) or a random replacement.

The second-level TLB is called the Main TLB. It catches any cache misses from the microTLBs.There is only one of these per processor, so it handles misses from both the data-side and instruction-sideMicroTLBs. The cache comprises eight fully associative entries, which are fast and may also have their contentslocked (i.e., they will not be replaced). It also contains 64 low associative entries. These are still much fasterthan a main memory access but may require a variable amount of clock cycles for lookup. The ability lock eight entriesdown is crucial for real-time systems. If you need to guarantee rapid real-time response to an event, you maynot be able to afford to waste the time to access an in-memory page table to do an address translation. Notethat the operating system, in addition to locking the TLB entries, has to ensure that the corresponding page framesstay locked in memory as well.

Figure 2 illustrates the full set elements that come into play for memory translation:

The first step is a MicroTLB lookup. An instruction fetch accesses the Instruction MicroTLB and a data read/write operation accesses the Data MicroTLB. If this lookup yields a match thenthe access permission in the page table entry are validated. If the request does not have properpermissions then a trap is generated (Data Abort signal). Otherwise, the memory access takes place.If the lookup yields no match then we continue to step 2.
If the MicroTLB lookup yielded a cache miss then we consult the Main TLB. The process is thesame as with the MicroTLB. If there is a match for the requested page and the permissions pass thenthe memory access is performed. If the lookup yields no match then we continue to step 2.
The final step is to consult the page table(s) in memory. The first-level table is checked first. The system supports two first-level tables. High-order bits of the virtual address determine which one to use.The base of the table is stored in one of two base registers (TTBR0 or TTBR1), depending on whetherthe topmost n bits of the virtual address are 0 (use TTBR0) or not (use TTBR1). The valueforn is defined by the Translation Table Base Control Register (TTBCR). Why this extracomplexity? This allowsone to have a design where the operating system and memory-mapped I/O are located in the upper partof the address space and managed by the page table in TTBR1 and user processes are in the lower partof memory and managed by the page table in TTB0. On a context switch, the operating system has to changeTTBR0 to point to the first-level table for the new process. TTBR0 will still contain the memorymap for the operating system and memory-mapped I/O. The page table defined in TTBR0 encompasses memory thatis common to all processes.
Looking up an address via in-memory page tables is called a translation table walk since itmay involve going through a hierarchy of tables. With the ARM MMU, this is a one-step direct mapping viathe first-level page table if sections the entry refers to a section or else a two-step process if the entry refers to a page. With sections, the physical base address is stored in the page table entry of thefirst-level table. With pages, the page table entry contains the address of thesecond-level table.

The following virtual-to-physical address translations are possible:

Translation flow for a 1 MB section: A 1 MB page (a section is just ARM's term for a big page) means that we needa 20-bit offset (2²⁰ = 1M). The 12 high-order bits of the 32-bit virtual addressserve as the index into the first-level page table and get translated into a 12-bit physicalsection base address.
<12-bit table index>:<20-bit section offset> →<12-bit section address>:<20-bit section offset>
Translation flow for a 16 MB supersection: A 16 MB page means that we needa 24-bit offset (2²⁴ = 16M). The 8 high-order bits of the 32-bit virtual addressserve as the index into the first-level page table and get translated into an 8-bit physicalsection base address that is combined with an 8-bit extended base address to produce a 40-bit real address.
<8-bit table index>:<24-bit supersection offset> →<8-bit extended base address>:<8-bit supersection address>:<20-bit section offset>
Translation flow for a 4 KB small page: With pages, we use a two-level hierarchy.The top 11 bits of the 32-bit virtual address contain the offset into the first-level table.The next 8 bits contain the offset into the partial page table (256 entries per each partial table).The partial page table contains a 20-bit physical page offset.The lowest 12 bits of the virtual address are the offset into this page.
<11-bit first-level index>:<8-bit second-level index>:<12-bit page offset> →<20-bit page>:<12-bit page offset>
Translation flow for a 64 KB large page: This is the same two-level hierarchy as with 4 KB pages but a different partitioning of bits.The top 11 bits of the 32-bit virtual address still contain the offset into the first-level table.The next 4 bits contain the offset into the partial page table (16 entries per each partial table).The partial page table contains a 16-bit physical page offset.The lowest 16 bits of the virtual address are the offset into this page.
<11-bit first-level index>:<4-bit second-level index>:<16-bit page offset> →<16-bit page>:<16-bit page offset>

Protection and memory behavior

Every memory access is checked against permissions stored within the page table entry of each memoryblock (either page or section). Moreover, the page table entry can also specify howa region of memory behaves with regard to internal caching the visibility of modifications to other cores or processors.Memory regions can have the following attributes:

Execute never: disallows the instruction fetch part of the processor to access this region of memory
Read-only, read/write, no access:These modes can be set for user-mode as well as privileged (kernel) mode. For example,kernel memory can be tagged as no access for user mode but read/write for kernel mode execution.
non-secure: Identifies memory as being part of a "trusted" region.
sharable:
This identifies whether a region of memory is shared with other processors or is mapped todevices. Several modes are available:
- Strongly ordered: memory accesses must occur in program order
- Device/shared or device/non-shared: the memory is mapped to a device (hence, not cached) andthe device is or is not shared with other processors on the bus.
- normal/shared, normal/non-shared: regular memory that is either not shared (feel free to cache) oris shared among other processors on the bus.

If the permission is not valid for access, the MMU traps with a Memory Abort signal to the processor.

Intel IA-32 and x86-64

The Intel architecture quickly became the dominant PC architecture since the introduction of the IBM PC in 1981, which used Intel's 8088 processor, which was a 16-bit processor with an external 8-bit data bus. A successionof both Intel and other compatible (e.g., AMD, VIA) processors followed, all supporting the same instruction set. The architecture advanced to a 32-bit with the 80386 and then a 64-bit architecture while still retaining backward compatibilitywith earlier instruction sets. The initial 8086/8088 architecture was strictly a segmented architecture withno memory access protection and no privileged mode of execution.TheIA-32 (Intel Architecture, 32 bits) architecture refers to the instruction set architectureof Intel processors since the 80386 through to today's Atom processors, which support a dual segmentation/paging model.Intel's 64-bit Itanium processor, with its IA-64 architecture, broke this compatibility and failed to achievecommercial success. Meanwhile, AMD extended the IA-32 instruction set to 64 bits and created thex86-64instruction set architecture, which Intel adopted after the failure of the Itanium. The x86-64 architecture offersseveral backward compatibility modes to execute IA-32 software. In this section, we will examine the highlights of theIA-32 and x64 MMU architectures.

IA-32 Memory Models

The IA-32 architecture supports three distinct memory models:

Flat memory model: This model features a 4 GB linear, contiguous address space from address 0x00000000 to 0xffffffff.
Segmented memory model: In this model, memory appears as a group of independent address spaces, calledsegments (e.g.,code segment, data segment, stack segment, etc.). The logical address is a combination of the segment selector andthe offset (address within the segment). While certain segments are associated with specific operations (e.g., code, stack), the processor supports 16,383 segments, each of which can address up to 2³² bytes (4 GB).A full memory address, called afar pointer is a 16-bit segment descriptor and a 32-bit offsetwithin the segment.
Real mode: This is the legacy 8086 segmented memory model. Segments can be up to 64 KB in size (16-bit addresses within a segment) and the maxiumum address space is limited to 2²⁰ bytes (1 MB). This is the mode used by the PC BIOS.

Figure 3. IA-32 Combined Segmentation & Paging

The IA-32 architecture supports a combined segmentation and paging model (figure 2).There are two tables of segments. Each contains base addresses of up to 8,191 segments.the Local Descriptor Table (LDT) contains segment informationthat is private per process. The Global Descriptor Table (GDT) contains segment information that is shared among all processes(e.g., segments into the operating system).The segment selector, whichcomprises 16 bits of a "far pointer", is an index into these tables. To select a segment, the selectoris loaded into a segment register. Also, every instruction that references memory has an implicitsegment register, which can be overridden by adding a segment prefix before the machine instruction.The 32-bit offset of the "far pointer" is the byte offset into the segment. Hence, a 32-bit linear address is effectively:

linear_address = LDT[segment_selector] + offset

linear_address = GDT[segment_selector] + offset

The value of the segment selector automatically selects the GDT (shared) or LDT (private).

This combined model uses segmentation to generate a linear address. This linear address is then treated as a virtual address and is translated by the paging logic of the MMU through a two-level page tablehierarchy.

The paging logic can be disabled if it is not needed.If segment descriptors contain 0, they effectively set the base address of a segment to 0, making every memory reference a linear one. Thus, you can effectively use either paging or segmentation or both.

IA-32 segmentation

Each entry in the GDT contains not just the base address of a segment but also a number of flags that define segment attributes:

S flag: Is this a code or data segment?
Accessed: Has the segment been accessed since the last time the OS cleared this bit?
Dirty: Has the page been modified since the last time the OS cleared this bit?
Data/write-enable: Is the segment read-only or read/write?
Data/expansion direction: Normally, changing the segment limit (size) causes space tobe added to the top of the segment. This changes the expansion direction downward, causing space to be added atthe bottom of memory. This was designed for stacks.
Code/execute-only or execute/read: If this is a code segment,can the memory be read as data or just executed?
Conforming: If the code is tagged as conforming then execution can continue even ifthe privilege level is elevated.

IA-32 paging

The paging logic uses a 32-bit logical address space to create either a 52-bit or a 36-bit physical address,addressing up to either 4 petabytes or 4 gigabytes of memory. 52-bit addressing is enabled via the PhysicalAddress Extension (PAE) mechanism. However, a process can access only a 4 GB address space at any time.The 36-bit address space is enabled via the Page Size Extension (PSE-36). The paging architecture supports 4 KBor 4 MB pages.

Intel 64-bit mode

Segments are fully supported in IA-32 emulation mode but are generally disabled in 64-bit mode. Intel 64-bit mode supports three paging modes:

32-bit paging

Figure 4. IA-32 Paging with 4 KB pages

Figure 5. IA-32 Paging with 4 MB pages

This mode supports 32-bit virtual addresses and generates 32-40 bit physical addresses usingeither 4 KB or 4 MB pages.With 4 KB pages (figure 4), a 2-level page table hierarchy is used. The CR3 register points to the top-level table, called thepage directory.It contains 1024 page directory entries (PDEs) and is indexed by the top 10 bits of the linear address.Each page directory entry contains the base address of a partial page table thatis indexed by the next 10 bits of the linear address. This table contains the page tableentry (PTE) that contains the top 20 bits of the physical address. The bottom 12 bits are obtained fromthe linear (virtual) address and are the offset into the page frame.

With 4 MB pages (figure 5), a single paging table is used. The CR3 points to the page table,which contains 1024 entries and is indexed by the top 10 bits of the linear address.Each entry in the page directory contains the page table entry,which contains the top 18 bits of a 40 bit physical address.The bottom 22 bits are the obtained directly from the linear address and are the offset into a 4 MB page frame.

PAE

Figure 6. PAE Paging with 4 KB pages

The PAE mode emulates the Physical Address Extension mode of the IA-32 architecture.It uses 32-bit virtual addresses and generates up to 52-bit physical addresses using either4 KB or 2 MB pages.

With 4 KB pages (figure 6), the MMU uses a three-level page table hierarchy is used to map a 32-bit virtual address to a 52-bit physical address. The topmost table is a 4-entrypage directory pointer table (PDPT) that is indexed by the top two bitsof the virtual address. An entry in the PDPT points to a partial page directory thatis indexed by the next 9 bits (512 entries per partial page directory). The page directorypoints to a partial page table, which is indexed by the next 9 bits (also 512 entriesper partial page table). A PTE contains the top 40 bits of the physical address.The bottom 12 bits are the obtained directly from the linear address and are the offset into a 4 KB page frame.

IA-32e paging

Figure 7. IA-32e Paging with 4 KB pages

This mode uses 48-bit virtual addresses and generates up 52-bit physical addressesusing either 4 KB, 2 MB, or 1 GB pages. A 52 bit address space can address 4096 terabytes of physical memory,which is considerably more than any systems in the near future are likely to have[3].

The most complex is IA-32e paging with 4 KB pages (figure 7). This employs a four-level page tablehierarchy. The 36 high-order bits of the 48-bit virtual address are split into chunks of 9 bits eachand each offset 512-entry table that contains a pointer to the next table.The final table contains a page table entry with a 40-bit address that is joined with 12 bitsof the page frame offset from the virtual address.

Figure 8. IA-32e Paging with 2 MB pages

By increasing the page size to 2 MB, we reduce the number of pages the system needs to manageby a factor of 512. This allows the MMU to use one less level of hierarchy in the page tablestructure (figure 8). With 2 MB pages, three levels of the hierarchy are used.

Figure 9. IA-32e Paging with 1 GB pages

By further increasing the page size to a whopping 1 GB, the page table is reduced to a two-levelhierarchy with a 512-entry table containing base addresses of 512-entry partial page tables (figure 9).

Translation Lookaside Buffers (TLBs)

The structure of TLBs is similar among the different Intel processors but varies in thesize of the cache. As an example, let's look at the Intel Core i7.Like the ARM MMU, there are two levels of TLB. The instruction and data sides of each core get aTLB that holds 128 entries for instructions and 64 entries for data. This is ultra-fast, fullyassociative, and allows a main memory access to take place in the same cycle. The second-level TLBis still much faster than going to main memory and is unified among the cores, containing 512 page table entries.

Case studies summary

Intel is unique among modern processors in offering a segmentation model along with paging in theirmemory management unit. When coupled with paging, using segmentation avoids the pitfalls of variablepartition allocation and provides a tidy method of controlling access permissions. For example, adata segment can be given a no-execute attribute regardless of the number of pages that it takes andthe operating system does not have to deal with replicating those permissions for each page that isallocated for data. However, modern operating systems have generally ignored segmentation, largelybecause implementing it would have been a blow to cross-platform compatibility. It's alsoan extra level of bookkeeping that is not strictly necessary. With the x86-64 architecture, segmentationhas been practically discarded.

The processors all support varying page sizes. Note that page sizes will be not be intermixed (that would make memorymanagement painful) but that the operating system, upon startup, will configure the MMU to use a specificpage size. The decision will largely be driven by the available memory on a system. For instance, while 4 KBis a highly popular (and generally standard) page size, it seems incredibly tiny on a system with 8 GB ofRAM. Such a system will have to manage over two million page frames!

The ARM architecture tries to optimize the size of the page table by supporting two base registers. Thisallows one to set up a single table that maps the memory for the operating system and memory-mapped I/O.It also supports "sections" and "supersections", that can coexist with pages and avoid the need fora second-level page table lookup and also optimize the use of the TLB by creating a page table entry thatpoints to a large region of memory.

Both architectures support a multi-level page table hierarchy. ARM is only a 32-bit architecture butthe 64-bit Intel architecture supports 48-bit addresses. Depending on the page granularity, theIntel architecture may require up to four levels of page tables (IA-32e paging with 4 KB pages).One of the perils of a large address space is that it can map to a very large number of page frames, necessitatinga deeper hierarchy to keep the size of page tables from getting too big.

Both architectures also employ a two-level TLB. The first level of the TLB is smaller and replicated for theinstruction and data fetch units of the microarchitecture. The second level is larger, a bit slower,and shared.

Finally, both architectures support page-based memory access permissions as well as indicators of whethera page has been modified (dirty) or referenced (accessed). As we'll see in the following discussion, bothof these are useful when we need to deal with page replacement.

Demand paging

It's not uncommon for the virtual address space to be bigger than the physical address space. A processtypically uses only a small fraction of the virtual address space and the operating system needs to maponly the parts that are used into physical memory (page frames).What if this isn't the case? What if a process needs to use more memory than is available on thesystem? Or, since we have a multitasking environment where we time slice execution among multiple processes,what if the memory requirements of all of our processes are greater than the physical memory on the system?In this case, we will have to save some pages onto the disk and bring them back later when we need them.

Demand paging is an approach to memory management where we load pages into memory only asthey are needed by the process. For example, when a process starts executing and tries to load its first instruction, the operating system will get a page fault because the required page has not been loaded and mapped onto a page frame. It will examine the faulting address, realize that the process is trying to accessa valid memory location that just happens not to be mapped to physical memory, and load the first page ofmemory from the program's executable file. As the program begins to run and tries to access some data (forexample), it will generate another page fault. The operating system will again process that fault andmap another page into a page frame. This process continues until all the pages that a process needs are in memory.

The valid/invalid bit in a page table entry (PTE) tells the memory management unit whether a specfic page is mapped into physical memory ("valid") or is not ("invalid"). An attempt to acces an invalidpage will cause the MMU to generate apage fault. If a page is not mapped to memory,that means that it is either an invalid (unallocated, out-of-bounds) region of the process'address space or that it is a valid region that is not currently resident in main memory. The operatingsystem has to determine which one of these cases holds.

The page fault handler performs the following actions:

Check the memory map in the process control block for that process to see if the memory access was indeed valid(versus out of bounds or dereferencing a null pointer)
If the address was not valid, terminate the process.
If the address was valid, we need to bring the page in:
- A page that corresponds to a text or data region of memory will be read from the program file on the disk.
- New stack and heap pages have no contents will not require reading contents from the disk.
- If the page maps to an existing stack or heap region and was saved onto the disk in the past because we didn'thave enough pages, it will be read from apage file on the disk.
Find a free frame in memory.
Schedule a disk read to bring the page to the frame.
Context switch, since we need to wait for the contents to be read. This, of course is not needed for newly-allocated pages.
When the disk read is complete, modify page table for the process. The page table entry will now be flagged as validand will contain the address of the page frame into which we read the page.
Restart the instruction that generated the fault.

Under Linux, a per-process structure called vm_area_struct defines regions of virtual memory. The structuredefines the start of each virtual memory region (address space), the end of the region, and access rights.Several of these structures are created for each file image that is mapped to memory.When a page fault occurs, the kernel searches a tree structure of memory allocations for the process to see if the faulting memory address is valid for the process or not. If it is not an allocated address, thena SEGV (segmentation violation) signal is sent to the process. If it is a valid address but the type of access is invalid, the process also gets a signal. Otherwise, we have a reference to a valid page but one thatis not currently in memory.

If we have a valid reference to a page that is not in memory, that page will either be in the executable file(for executable code and static data) or in a swap file (temporary space to hold swapped-out pages). If the PTE is not valid but a page number is present, then the page we want has been saved to a swap file. The page number field in the PTE tells us the location of that page within the file. If, on the other hand, thePTE is not valid and there is no page number present in the PTE for the page, then we know that we need toload the page contents from the program file.

Cost of paging from a disk

A disk is much, much slower than memory, so reading a page from a disk does not come without a performanceburden. To get a very rough idea of just what this burden is, we can consider a few numbers.A memory access takes approximately 100 nanoseconds.Handling a page fault exception may take around 400 microseconds. That's a 4,000x penalty just for handlingthe trap associated with the page fault, not counting the disk operation to get the data.A disk seek and read may take around 10 milliseconds. That'saround 100,000 slower than a memory access!

Clearly, page faults are things we would like to avoid whenever possible. Given the 100,000-fold performancepenalty, if we want to ensure that our application does not degrade in performance by over 10%, we need tohope that we have no more than one page fault per million memory accesses.

Page replacement

In the ideal case, we have many unused page frames in our operating system and the page fault handlerwill pick one of those, load a page into it, and map it in the page table of the requesting process.What happens if all page frames are being used by processes?In this case, we need to replace a page that's in memory with one that's on the disk. The sequence ofoperations is:

Pick a page to evict.
Save the contents of the page onto the paging file on the disk if the page has been modified(the dirty bit in the page table entry is set). If the page has not been modified (e.g., it'sa static data or code page) then we can always re-read it from the program file.
Change the page table for the process whose page we evicted to show that the corresponding page is no longer valid.
Read the desired page into the frame we chose.
Update the page table for the process that caused the page fault (the one that needed the page) so that the new page is valid and pointing to the page frame into which we read it.
Return from exception, restarting the instruction that generated the fault.

What we don't want to do is evict a page that is likely to be used again because then we'll have to gothrough the same sequence of operations just to get it back. We need a good page replacement policy to ensure that we can still maintain good performance. Let's examine a few replacement policies.

FIFO (First In, First Out) replacement

Common sense may tell us that this is a good algorithm. The oldest memory pages are likely to be a program's initialization code that is no longer needed. Unfortunately, some of these early ("first in")pages may contain frequently-used libraries or global variables. Getting rid of these pages is definitely something we don't want to do.

Least Recently Used (LRU)

What we really want to do is pick the least recently used page under the assumption that it's theleast likely to get used again. This way we can distinguish pages holding initialization code (whichwill be least recently used) versus pages holding crucial libraries, which will be recently usedeven if they are older than other pages.

To implement an LRU, we need to timestamp each page whenever it is accessed.When we need to remove a page, we search for the page frame that has the oldest timestamp.

This is a good algorithm and it's exactly what we want BUT ... it is not feasible to generate and loga timestamp for each access to a page. That means that every memory operation will be accompanied bya memory write of several bytes of timestamp data. Memory management units don't do this. Moreover,even if they did, this would bog down the performance of the memory system. We need to find a simpler approach that approximates LRU.

Not Frequently Used replacement

Here's a first attempt at simulating an LRU algorithm.Each page table entry had a "referenced" bit. The operating system will also keep a counter for each page frame.At each periodic clock interrupt, the operating system will go through page table entries and add thereferenced bit (which will be 0 or 1) to the page frame counter that corresponds to that page. It will thenclear the referenced bits.To pick a page to evict, the operating system will choose the frame with the lowest counter.

The problem with this algorithm is that it has no sense of time. It can log a lot of page referencesto a specific page, hence causing that page frame's counter to reach a high value, but we won't know ifthose references happened a long time ago or recently. Moreover, iterating through all the page framesto update counters is a time-consuming operation — not something we want to do on every clock interrupt.

Clock (Second Chance) replacement

In this algorithm, we arrange the physical pages in a logical circle (a circular queue).A pointer (clock hand) points to first frame.The paging hardware provides us with an "referenced" bit per frame to tell us if the page has been referenced. Thebit is set whenever there's a memory reference.The algorithm will use this "referenced" bit to determine whether a page has been referenced in a recent window of time.

On page fault, we advance our pointer (clock hand) to point to the next page frame.Then we check the "referenced" bit for that page.If the bit is 1, then we deem that the page has been used recently. It's not a good targetfor replacement. We clear the "referenced" bit and advance the pointer to the next page frame.If, however, the bit is 0, we pick this page as the page to evict.

Note that if all the pages have been referenced since the last page fault, we end up going around the entire circular queue — the entire set of frames. As we go through the frames, we clear the "referenced" bit, sowe'll end up right where we started and, this next time around, we'll see that the referenced bit is 0. Hence,if every page frame does get referenced, the algorithm degenerates to round-robin replacement.

N^th chance replacement

This approach is similar to the Second Chance algorithm. The variation is that we maintain a counter along with a "referenced" bit.On a page fault, weadvance our pointer (clock hand) to point to the next page frame.We then check the "referenced" bit for that page.If it is 1, we clear the bit and set the counter to 0.If it is 0, we increment the counter. If counter < N, go on. Otherwise, we evict this page.

What we are doing, is counting the number of times that we examined this page and it hasnotbeen referenced. The value of N has to be tuned to a specific environment and serves asa threshold for determining whether a page should be evicted because it has not been referenced aftera number of page faults.The N^th Chance replacement algorithm gives us a better approximation of LRU Than second chance.

Working Sets & Thrashing

A timesharing operating system will have multiple processes loaded into memory at the same time and will be switching between executing all the ones that are in aready state. Each processhas its own address space that is managed by a page table that the operating system maintains.During a context switch, the operating system will change the page table base register to switchmemory mappings to the address space of the new running process. If the TLB supports the conceptof address space identifiers (ASIDs), the operating system will have to maintain a mapping betweenprocesses and ASIDs and change the ASID along with changing the page table base register. If the kernel does not support ASIDs (e.g., Intel processors before the Core 2), then the TLB contents get flushed (invalidated) whenever the page table is changed.

As a process runs, it exhibits a certain degree of locality. Instructions usuallyappear in adjacent memory locations, most loops are over a small area, and data references tend to be clustered.This collection of memory regions, the set of pages that have been referenced in a recent time window,is known as the process'working set. The size of the working set as well as thepages that make up the working set varies during the execution of a program.For a process to perform well (i.e., not generate a page fault), it should have its working set in memory.

As we saw when we examined process scheduling, having more processes in memory is good since it increasesCPU utilization and, hence, overall throughput. The more processes are in memory, the higher the chancethat one of them is available to run.What's bad, however, is when processes to not have enough page frames available to them. If a process doesnot have its working set in memory, it will very likely generate a page fault when it is next scheduledto run. When it's given a page and, later, another process runs, that process may be in the same predicamentand also generate a page fault. This situation where processes do not have enough page frames available to them is calledthrashing. Thrashing occurs when the sum of all working sets of allprocesses in the system is greater than the total memory in the system. Given the cost of disk access,it has a devastating effect on system performance.

While thrashing may be unavoidable when too many processes run and consume all available memory, it can be delayed though smartresident set management. The resident set of a processis the set of that process' pages that are currently resident on the system (mapped to page frames). What we want to do is to ensure that one process does not grab an unfair amount of memory relative toother processes. Ideally, a process would be allocated only a sufficient number of pages to hold itsworking set. Of course, we cannot predict what pages a process will access and don't know the process' working set. Theworking set model approximates the locality characteristics of a program.It starts of by defining aworking set window, Δ. This issome fixed amount of elapsed time while the process was actually in the running state (e.g.,n numberof memory references). The working set size for a processP_i, WSS_i,is defined as being the sum of the pages in the most recent Δ page references. The total system-wide demand for frames isD = ΣWSS_i. This is just the sumof the working set sizes for all processes. IfD is greater than the total memory size thenwe can expect thrashing.

One way to deal with ensuring that memory is distibuted fairly among processes is by monitoringpage fault frequency. If a process does not have its working set in memory, it willgenerate page faults. This is an indication that it needs more memory. However, if a process is nevergenerating page faults, that could be an indication that it has too much memory allocated to it —more than its working set. If we don't have memory to spare, we expect processes to generate a certain, small,amount of page faults as the process' working set changes over time.By setting low and high per-process thresholds of page faults, we can take pages away from processesthat generate too few page faults and give them to processes that generate too many.

Footnotes

[1]Every process needs to have the kernel memory, including memory-mapped I/O mapped into its address space. A system calltrap or a hardware interrupt changes the mode of execution from user-mode to kernel-mode but the memory map does not change.The kernel continues executing with the current page table until it explicitly changes it, which itdoes on a context switch.

[2]Older ARM-v6 architectures supported 10 entries in the MicroTLB.

[3]Don Knuth writes, "It is absolutely idiotic to have 64-bit pointerswhen I compile a program that uses less than 4 gigabytes of RAM.When such pointer values appear inside a struct, they not only wastehalf the memory, they effectively throw away half of the cache"[link]

References

John R. Levine, Linkers and Loaders, Morgan-Kauffman, October 1999, ISBN 1-55860-496-0.Linkers and Loaders, John R. Levine
Intel Software Developer's Manual(Download site)
ARM1176JZ-S Technical Reference Manual, Revision r0p7,Chapter 6: Memory Management Unit.
(pdf version)
Physical Address Extension, Wikipedia article
x86-64, Wikipedia article
Four-level page tables merged, LWN.net article
Also, see:4level page tables merged into mainline, LWN discussion
Anatomy of a program in memory, Gustavo Duarte, Jan 2009.

0 0

MMU paging

Memory Management: Paging

Introduction/Review

Logical and physical addressing modes

System memory map

Page allocator and kernel memory management

Case studies

ARMv7-A MMU Architecture

Sections and pages

Translation Lookaside Buffers (TLB)

Protection and memory behavior

Intel IA-32 and x86-64

IA-32 Memory Models

IA-32 segmentation

IA-32 paging

Intel 64-bit mode

Translation Lookaside Buffers (TLBs)

Case studies summary

Demand paging

Cost of paging from a disk

Page replacement

FIFO (First In, First Out) replacement

Least Recently Used (LRU)

Not Frequently Used replacement

Clock (Second Chance) replacement

Nth chance replacement

Working Sets & Thrashing

Footnotes

References

N^th chance replacement