How The Kernel Manages Your Memory（内核如何管理程序的内存）

来源：互联网发布：软件传销编辑：程序博客网时间：2024/05/21 06:34

转载自：http://duartes.org/gustavo/blog/post/how-the-kernel-manages-your-memory

After examining the virtual address layout of a process, we turn to the kernel and its mechanisms for managing user memory. Here is gonzo again:

Linux kernel mm_struct

Linux processes are implemented in the kernel as instances of task_struct , the process descriptor.The mm field in task_struct points to the memory descriptor , mm_struct , which is an executive summary of a program's memory. It stores the start and end of memory segments as shown above, the number of physical memory pages used by the process ( rss stands for Resident Set Size), theamount of virtual address space used, and other tidbits. Within the memory descriptor we also find the two work horses for managing program memory: the set of virtual memory areas and the page tables . Gonzo's memory areas are shown below:

Kernel memory descriptor and memory areas

Each virtual memory area (VMA) is a contiguous range of virtual addresses; these areas never overlap. An instance of vm_area_struct fully describes a memory area, including its start and end addresses, flags to determine access rights and behaviors, and the vm_file field to specify which file is being mapped by the area, if any. A VMA that does not map a file is anonymous . Each memory segment above ( eg , heap, stack) corresponds to a single VMA, with the exception of the memory mapping segment. This is not a requirement, though it is usual in x86 machines. VMAs do not care which segment they are in.

A program's VMAs are stored in its memory descriptor both as a linked list in the mmap field, ordered by starting virtual address, and as a red-black tree rooted at the mm_rb field. The red-black tree allows the kernel to search quickly for the memory area covering a given virtual address. When you read file /proc/pid_of_process/maps , the kernel is simply going through the linked list of VMAs for the process and printing each one .

In Windows, the EPROCESS block is roughly a mix of task_struct and mm_struct. The Windows analog to a VMA is the Virtual Address Descriptor, or VAD ; they are stored in an AVL tree . You know what the funniest thing about Windows and Linux is? It's the little differences.

The 4GB virtual address space is divided into pages . x86 processors in 32-bit mode support page sizes of 4KB, 2MB, and 4MB. Both Linux and Windows map the user portion of the virtual address space using 4KB pages. Bytes 0-4095 fall in page 0, bytes 4096-8191 fall in page 1, and so on. The size of a VMA must be a multiple of page size . Here's 3GB of user space in 4KB pages:

4KB Pages Virtual User Space

The processor consults page tables to translate a virtual address into a physical memory address.Each process has its own set of page tables; whenever a process switch occurs, page tables for user space are switched as well. Linux stores a pointer to a process' page tables in the pgd field of the memory descriptor. To each virtual page there corresponds one page table entry (PTE) in the page tables, which in regular x86 paging is a simple 4-byte record shown below:

x86 Page Table Entry (PTE) for 4KB page

Linux has functions to read and set each flag in a PTE. Bit P tells the processor whether the virtual page is present in physical memory. If clear (equal to 0), accessing the page triggers a page fault.Keep in mind that when this bit is zero, the kernel can do whatever it pleases with the remaining fields. The R/W flag stands for read/write; if clear, the page is read-only. Flag U/S stands for user/supervisor; if clear, then the page can only be accessed by the kernel. These flags are used to implement the read-only memory and protected kernel space we saw before.

Bits D and A are for dirty and accessed . A dirty page has had a write, while an accessed page has had a write or read. Both flags are sticky: the processor only sets them, they must be cleared by the kernel. Finally, the PTE stores the starting physical address that corresponds to this page, aligned to 4KB. This naive-looking field is the source of some pain, for it limits addressable physical memory to 4 GB . The other PTE fields are for another day, as is Physical Address Extension.

A virtual page is the unit of memory protection because all of its bytes share the U/S and R/W flags.However, the same physical memory could be mapped by different pages, possibly with different protection flags. Notice that execute permissions are nowhere to be seen in the PTE. This is why classic x86 paging allows code on the stack to be executed, making it easier to exploit stack buffer overflows (it's still possible to exploit non-executable stacks using return-to-libc and other techniques). This lack of a PTE no-execute flag illustrates a broader fact: permission flags in a VMA may or may not translate cleanly into hardware protection. The kernel does what it can, but ultimately the architecture limits what is possible.

Virtual memory doesn't store anything, it simply maps a program's address space onto the underlying physical memory, which is accessed by the processor as a large block called the physical address space . While memory operations on the bus are somewhat involved , we can ignore that here and assume that physical addresses range from zero to the top of available memory in one-byte increments. This physical address space is broken down by the kernel into page frames . The processor doesn't know or care about frames, yet they are crucial to the kernel because the page frame is the unit of physical memory management. Both Linux and Windows use 4KB page frames in 32-bit mode; here is an example of a machine with 2GB of RAM:

Physical Address Space

In Linux each page frame is tracked by a descriptor and several flags . Together these descriptors track the entire physical memory in the computer; the precise state of each page frame is always known. Physical memory is managed with the buddy memory allocation technique, hence a page frame is free if it's available for allocation via the buddy system. An allocated page frame might beanonymous , holding program data, or it might be in the page cache , holding data stored in a file or block device. There are other exotic page frame uses, but leave them alone for now. Windows has an analogous Page Frame Number (PFN) database to track physical memory.

Let's put together virtual memory areas, page table entries and page frames to understand how this all works. Below is an example of a user heap:

Physical Address Space

Blue rectangles represent pages in the VMA range, while arrows represent page table entries mapping pages onto page frames. Some virtual pages lack arrows; this means their corresponding PTEs have the Present flag clear. This could be because the pages have never been touched or because their contents have been swapped out. In either case access to these pages will lead to page faults, even though they are within the VMA. It may seem strange for the VMA and the page tables to disagree, yet this often happens.

A VMA is like a contract between your program and the kernel. You ask for something to be done (memory allocated, a file mapped, etc.), the kernel says “sure”, and it creates or updates the appropriate VMA. But it does not actually honor the request right away, it waits until a page fault happens to do real work. The kernel is a lazy, deceitful sack of scum; this is the fundamental principle of virtual memory. It applies in most situations, some familiar and some surprising, but the rule is that VMAs record what has been agreed upon , while PTEs reflect what has actually been done by the lazy kernel. These two data structures together manage a program's memory; both play a role in resolving page faults, freeing memory, swapping memory out, and so on. Let's take the simple case of memory allocation:

Example of demand paging and memory allocation

When the program asks for more memory via the brk() system call, the kernel simply updates the heap VMA and calls it good. No page frames are actually allocated at this point and the new pages are not present in physical memory. Once the program tries to access the pages, the processor page faults and do_page_fault() is called. It searches for the VMA covering the faulted virtual address using find_vma() . If found, the permissions on the VMA are also checked against the attempted access (read or write). If there's no suitable VMA, no contract covers the attempted memory access and the process is punished by Segmentation Fault.

When a VMA is found the kernel must handle the fault by looking at the PTE contents and the type of VMA. In our case, the PTE shows the page is not present . In fact, our PTE is completely blank (all zeros), which in Linux means the virtual page has never been mapped. Since this is an anonymous VMA, we have a purely RAM affair that must be handled by do_anonymous_page() , which allocates a page frame and makes a PTE to map the faulted virtual page onto the freshly allocated frame.

Things could have been different. The PTE for a swapped out page, for example, has 0 in the Present flag but is not blank. Instead, it stores the swap location holding the page contents, which must be read from disk and loaded into a page frame by do_swap_page() in what is called a major fault .

This concludes the first half of our tour through the kernel's user memory management. In the next post, we'll throw files into the mix to build a complete picture of memory fundamentals, including consequences for performance.

/*********************************************************************

google机器人翻译的结果，参考一下

**********************************************************************/

内核如何管理你的内存

检查后的虚拟地址布局的过程中，我们的内核和用户内存管理机制。这里又是愚蠢：

Linux内核的mm_struct

的Linux进程task_struct的，进程描述符的实例在内核中实现的。 内存描述符 ，指向mm_struct 毫米领域中的task_struct点，这是一个程序的内存的执行摘要。它存储了如上图所示的内存段的开始和结束，所使用的物理内存页的过程（RSS代表驻留集大小）的虚拟地址空间的使用量，以及其他花絮。在内存描述符中，我们还可以找到两个工作管理程序存储器：马组的虚拟内存区和页表。 Gonzo的内存区域，则如下所示：

内核内存描述符和存储区

每个虚拟内存区（VMA）是一个连续的虚拟地址范围，这些区域不重叠。的一个实例的vm_area_struct充分描述的存储区域，包括其开始和结束地址，标志，以确定访问权限和行为，和vm_file字段指定哪些文件被映射到由区域，如果有的话。不映射文件的一个VMA是匿名的。以上（例如，堆，栈）每个存储器段对应于一个单一的VMA，与异常的内存映射段。这不是必需的，但它通常是在x86机器。 VMA的不关心这部分他们所处

一个程序的VMA都存储在其内存描述符作为一个链表的mmap的领域中，命令的起始虚拟地址，并作为一个红黑树的扎根在mm_rb领域。红黑树的快速搜索允许内核的内存区域覆盖给定虚拟地址。当你读文件/的进程/ pid_of_process /地图 ，内核是简单地通过VMA的链表的过程中，每一个印刷。

在Windows中， EPROCESS块大约进程和mm_struct的组合。一个VMA的Windows模拟的虚拟地址描述符，或VAD ，它们存储在一个AVL树。你知道Windows和Linux的最有趣的事情是什么？这是一个小的差异。

4GB的虚拟地址空间被划分为页面。在32位模式下支持的页面大小为4KB，2MB，以及4MB的x86处理器。 Linux和Windows的用户部分使用4KB的页面的虚拟地址空间映射。字节4095秋季第0页，在第1页字节4096-8191秋季，等等。一个VMA的大小必须是页大小的倍数 。这里有3GB的用户空间的4KB的页面：

4KB页的虚拟用户空间

该处理器参考页表翻译一个虚拟地址转换成物理存储器地址。每个进程都有自己的一套页表发生进程切换时，切换用户空间的页表。 Linux的一个进程的页表中的PGD的存储器描述符的字段存储一个指针。给每个虚拟页面有对应一个页表项 （PTE）在页表中，这在常规的x86寻呼是一个简单的4字节的记录，如下所示：

x86的为4KB页的页表项（PTE）

Linux的功能，读取和设置每个标志的PTE。位P告诉虚拟页面是否是出现在物理存储器的处理器。如果明确（等于0），访问该页面的触发页面错误。请记住，当该位为0时，内核就可以为所欲为 ，其余字段。读/写R / W标志表示，如果清晰，页面是只读的。标志U / S代表用户/管理员，如果明确的，然后在页面只能由内核访问。这些标志用于实施只读存储器和保护的内核空间，我们之前看到的。

位D和A是脏和访问的 。有一个写脏页，有一个读或写访问的页面。这两个标志都粘处理器只设置他们，他们必须被清除的内核。最后，PTE存储的起始物理地址对应到这个页面，为4KB对齐。这天真的领域是一些疼痛的来源，因为它限制到4 GB的可寻址的物理内存。该其他PTE领域的是新的一天，因为是物理地址扩展。

虚拟页面，因为其所有的字节的内存保护单元共享的U / S和R / W标志。然而，相同的物理内存可以被映射不同的页面，可能有不同的保护标志。注意执行权限无处可看到在PTE。这就是为什么经典的x86寻呼允许在堆栈上执行代码，使其更容易利用堆栈的缓冲区溢出（它仍然可以利用非可执行堆栈返回到libc中和其他技术）。这种缺乏的一个PTE不执行标志说明一个更广泛的事实：在VMA的权限标志，可能会或可能不会翻译干净的硬件保护。内核做是可以的，但最终的架构限制什么是可能的。

虚拟内存不存储任何东西，它只是一个程序的地址空间映射到底层的物理内存作为一个大的块称为物理地址空间 ，这是由处理器访问。虽然内存总线上的操作较为复杂，我们可以忽略，在这里，假设物理地址范围从零到顶部的可用内存以字节为单位。物理地址空间被分解到页面帧的内核。处理器不知道或关心帧，但他们是至关重要的页框的内核，因为是单位的物理内存管理 Linux和Windows使用4KB的页帧在32位模式下，这里是一个例子机2GB的RAM：

物理地址空间

在Linux中，每个页框的描述符和几个标志进行跟踪。这些描述符跟踪整个物理内存在计算机中的每一页帧精确的状态是已知的。物理内存管理的哥们内存分配技术，因此一个页框是免费的，如果它是通过伙伴系统的分配。分配的页框可能是匿名的 ，保存程序数据，或者它可能是在页面缓存 ，数据存储在文件或块设备的。还有其他一些异国情调的页面帧的用途，但现在他们独自离开。 Windows有一个类似的页面帧号（PFN）数据库来跟踪物理内存。

让我们把虚拟内存区域，页表项和页面框架来理解这一切是如何工作的。下面是一个例子的用户堆：

物理地址空间

蓝色的矩形代表在VMA范围内的页面，，而箭头页映射到页面帧的页表项。一些虚拟的网页缺乏箭头，这意味着其相应的PTE有明确的Present标志。这可能是因为页面从未被触及或已被换出，因为它们的内容。无论在哪一种情况下，对这些网页的访问将导致页面错误，即使它们是在VMA。 VMA和页表不同意，但这种情况经常发生，这似乎很奇怪。

一个VMA是像你的程序和内核之间的合同。你问的事情做了（内存分配，文件映射等），内核中说，“肯定”，它创建或更新相应的VMA。不过，这并不实际兑现请求权，它要等待一个页面故障发生时做实事。内核是一个懒散的，骗人的败类袋，这是虚拟内存的基本原则。它适用于大多数情况下，一些熟悉的一些令人吃惊的，但是规则是VMA的纪录已经商定，而PTE的反映究竟是懒惰的内核。这两个数据结构，管理程序的内存发挥作用，在解决缺页，释放内存，交换内存，等等。让我们来简单的情况下，内存分配：

例如，的需求分页和内存分配

当程序要求更多通过BRK（）系统调用的内存，内核更新的堆VMA，并调用它的好。在这一点上没有页面帧的实际分配和新的页面在物理内存中是不存在的。一旦程序试图访问的网页，该处理器页面错误和do_page_fault（）被调用。它搜索的VMA覆盖故障的虚拟地址使用find_vma（）。如果找到了，在VMA上的权限，还要检查对试图访问（读或写）。如果没有合适的VMA，没有合同包括企图内存，访问将被处以分割故障的过程。

当一个VMA被发现，内核必须处理的故障看的的PTE内容和类型的VMA。在我们的例子中，PTE显示的页面是不存在的。事实上，我们PTE是完全空白（零），在Linux从来没有被映射的虚拟页。由于这是一个匿名的VMA，我们有一个纯粹的RAM事必须处理的do_anonymous_page（），分配一个页框，使PTE发生故障的虚拟页映射到新分配的框架。

事情可能会有所不同。 PTE换出的页面，例如，0 Present标志，但不是空白的。相反，它存储交换位置的页面内容，它必须从磁盘中读取并加载到一个页面的框架的do_swap_page（）在什么是所谓的重大故障。

总结上半年我们的旅游内核的用户内存管理。我们将在下一篇文章中，将文件转换成组合，以建立一个完整的画面，记忆的基础，包括对性能的影响。