重读<Understanding The Linux Virtual Memory Manager>笔记

来源：互联网发布：ubuntu查看系统存储编辑：程序博客网时间：2024/05/18 01:09

http://blog.csdn.net/melody_lu123/article/details/6996841

重读<Understanding The Linux Virtual Memory Manager>笔记 (转自自己的google doc)

第二�?物理内存的描�?/strong>

    区分NUMA和UMA
    这牵涉到所谓的内存划分与相关的CPU之间访问的代�?/span>

    内核中关键的描述内存块的几个关键结构:

node: 内核对内存块的称�?
两者都由同一个数据结构描述pglist_data, 所有的块都由一个单链表管理.如果是UMA那么只会有pglist_data的一个实例contig_page_data存在.

node又被分为一系列的zone,它表示的是一段段的内存地址空间.zone数据结构struct zone表示.每个zone都根据内核的配置属于ZONE_DMA/ZONE_DMA32/ZONE_NORMAL/ZONE_HIGHMEM/ZONE_MOVABLE中的一�?各个ZONE类型的说明参见enum zone_type的定义中的说�?

物理的page frame由struct page来描�? 所有配置为支持DISCONTIGMEM的内核中的page都被一个全局的由各个体系结构实现的mem_map所管理,这样内核就可以快速的通过它来得到page与memory之间的对应关�?
书中所给出的这几个结构的关�?/span>

    关于typedef struct pglist_data

对于该结构体中关键元素的描述有一些已经过�?比如node_start_paddr已经被node_starg_pfn所代替(原因在书中已经被作者描�?

关于什么是PFN
Page Frame Number (PFN). 它是一个index,表示的是物理内存按照page-sized大小计数之后的index. 对于一个物理地址要得到它的PFN可以简单使�? page_phys_addr >> PAGE_SHIFT来获�?
min pfn紧跟在内核img之后
max pfn是体系结构相关的,对于x86来说它通过bios提供的信息e820 table来定�?而对于arm, 一般都是由bootloader负责跟相关硬件交�?再把用户配置的值传递给对应的kernel init 参见http://www.simtec.co.uk/products/SWLINUX/files/booting_article.html,及我的博客对于pandaboard启动过程的分�?

wait_table: zone中的一个关键变�?用来处理对page操作的并发控�?使得一次只有一个user在操作该page. 这里使用了hash的设计思想,从而避免了对每个page都有一个wait queue,这样的设计降低了内存使用也提高了效率.

zone的初始化核心函数在free_area_init_core.

for_each_online_pgdat用来帮助遍历所有node结构.

    关于FLAT memory和SPARSE memory的讨�?/span>
    http://lxr.linux.no/#linux+v3.0.4/Documentation/memory-hotplug.txt

    http://forums.gentoo.org/viewtopic-t-872703-view-previous.html

Sparse memory is only of use on NUMA systems or systems supporting hot pluggable memory.

A NUMA system supports several CPUs (not CPU cores) each with their own local memory and a way for each CPU to addess the global memory pool. Accessing the non-local memory is usually slower.

When memory is hot plugged, the kernel determines where in the physical address space it goes.

Both features make a big hole in your wallet - you would know if you had either of them.

    关于struct zone
    各个关键结构成员的意义参见zone的定�?
    提供了watermarks的信�?用它与kswapd交互.
    文中所提的三种watermarks状态所对应的内核动作的验证: TBD

pages_low When pages_low number of free pages is reached, kswapd is woken up by the buddy allocator to start freeing pages. This is equivalent to when lotsfree is reached in Solaris and freemin in FreeBSD. The value is twice the value of pages_min by default;

pages_min When pages_min is reached, the allocator will do the kswapd work in a synchronous fashion, sometimes referred to as the direct-reclaim path. There is no real equivalent in Solaris but the closest is the desfree or minfree which determine how often the pageout scanner is woken up;

pages_high Once kswapd has been woken to start freeing pages it will not consider the zone to be “balanced�?when pages_high pages are free. Once the watermark has been reached, kswapd will go back to sleep. In Solaris, this is called lots free and in BSD, it is called free_target. The default for pages_high is three times the value of pages_min.

http://linux.sys-con.com/node/431838?page=0,0

http://linuxdevcenter.com/pub/a/linux/2006/11/30/linux-out-of-memory.html?page=1

    关于struct page结构�?/span>
    参照<professional linux kernel architecture>的page 148 Page Frame的描�?/span>
    可以略过的章�?/span>
    2.4�?转而参�?lt;professional linux kernel architecture>

第三�?页表的管�?/strong>

    cache和TLB
    x86可以自动的管理这两�?通过软件和硬�?,而对于其它没有这个特性的系统,需要hook相关的操�?比如)
    相关资料:http://en.wikipedia.org/wiki/Translation_lookaside_buffer
    不同的体系结构在处理TLB的flush上的操作会有所不同,这里x86和arm体系提供的选择会丰富些
    各个体系结果tlb的操作都有对应的实现
    比如,arm的就�?arch/arm/include/asm/tlbflush.h

    这章的最后提供了一系列关于cache,tlb操作的函�?当然,不同的体系结构有不同的实现喽!!
    Page Table Entry Proctection和Status bits
    它们都是随体系结构不同而不�? 每种体系结构对相应的bit位的解释会有所不同

    page table会由于效率的考量被不同的list所cache, 不同的体系结构的考量不一�?对其的cache处理略有不同,但大致思想是一致的.

    3.6节介绍了page table建立的过�?即启动过程中很重要的一部分
    startup_32()用来初始化基础内存,从而为初始化page系统做准�?/span>
    page_init用来真正的初始化page table
    各个体系的实现都有所差别,需要区别对�?比如arm的相关操作在arch/arm/mm/mmu.c的page_init()�?/span>

第四�?进程地址空间

    mm_struct用来描述一个process的地址空间,每个进程只有一个对应的mm_struct, 该进程的所有线程共享这一个mm_struct
    kernel thread不需要这个mm_struct, 对应的task_struct中的mm为null
    task_struct中的active_mm, 起到了通过它与前一个进程的mm_struct向联�?即借用上衣个task的mm_struct)来减少不必要的TLB flush, 极大的提高了性能. Lazy TLB (kernel thread 通常使用�? 因为它们不care相关的user space)

    vm_area_struct用来描述一块块被process使用的vma
    这里用了两个数据结构来帮助管�?主要是查�?VMA, 其中的red-tree用来帮助在发生page fault的时候加快查找速度(最长见的就是查找一个给定address属于哪一块vma)

    其中的vm_operations_struct中的fault用来处理page fault场景的callback.
    其中address_space用来与文件或设备相关�? 它所关联的address_space_operations用来处理回写到磁盘上之类的操�?
    find_vma()的相关描述参�?lt;Professional Linux Kernel Architectual>中的4.5.1
    如何区分一个COW page
    VMA区域标志为可�?但是对应的PTE确没有标识为可写. A COW page is recognised because the VMA for the region is marked writable even though the individual PTE is not.
    需要确认最新的内核的实�?/span>

Non-Linear VMA Population

In 2.4, a VMA backed by a file is populated in a linear fashion. This can be optionally changed in 2.6 with the introduction of the MAP_POPULATE flag to mmap() and the new system call remap_file_pages(), implemented by sys_remap_file_pages(). This system call allows arbitrary pages in an existing VMA to be remapped to an arbitrary location on the backing file by manipulating the page tables.

On page-out, the non-linear address for the file is encoded within the PTE so that it can be installed again correctly on page fault. How it is encoded is architecture specific so two macros are defined called pgoff_to_pte() and pte_to_pgoff() for the task.

This feature is largely of benefit to applications with a large number of mappings such as database servers and virtualising applications such as emulators. It was introduced for a number of reasons. First, VMAs are per-process and can have considerable space requirements, especially for applications with a large number of mappings. Second, the search get_unmapped_area() uses for finding a free area in the virtual address space is a linear search which is very expensive for large numbers of mappings. Third, non-linear mappings will prefault most of the pages into memory where as normal mappings may cause a major fault for each page although can be avoided by using the new flag MAP_POPULATE flag with mmap() or my using mlock(). The last reason is to avoid sparse mappings which, at worst case, would require one VMA for every file page mapped.

However, this feature is not without some serious drawbacks. The first is that the system calls truncate() and mincore() are broken with respect to non-linear mappings. Both system calls depend depend on vm_area_struct→vm_pgoff which is meaningless for non-linear mappings. If a file mapped by a non-linear mapping is truncated, the pages that exists within the VMA will still remain. It has been proposed that the proper solution is to leave the pages in memory but make them anonymous but at the time of writing, no solution has been implemented.

The second major drawback is TLB invalidations. Each remapped page will require that the MMU be told the remapping took place with flush_icache_page() but the more important penalty is with the call to flush_tlb_page(). Some processors are able to invalidate just the TLB entries related to the page but other processors implement this by flushing the entire TLB. If re-mappings are frequent, the performance will degrade due to increased TLB misses and the overhead of constantly entering kernel space. In some ways, these penalties are the worst as the impact is heavily processor dependant.

It is currently unclear what the future of this feature, if it remains, will be. At the time of writing, there is still on-going arguments on how the issues with the feature will be fixed but it is likely that non-linear mappings are going to be treated very differently to normal mappings with respect to pageout, truncation and the reverse mapping of pages. As the main user of this feature is likely to be databases, this special treatment is not likely to be a problem.

第五�?Boot Memory分配�?/strong>

    用来为内核其它的部分做准备的一次性分配器, 主要用来在MMU等kernel的其它重要逻辑起作用之�?进行必要的准�?/span>
    对于UMA和NUMA各有一个套对应的boot memory allocator的接�?/span>
    不同的体系结构需要提供自己的boot memory allocator的实�?入口点都在setup_arch())
    x86下是setup_memory
    arm下是bootmem_init
    MIPS, SPARC是bootmem_init()
    PPC是do_init_bootmem

    各个体系结构通过各自的方式获得了物理内存的信息之�?经过一些简单的其它设置,就开始调用各自的boot memory allocator来为之后的其它组件分配合适的内存�?

�?�?物理页分�?/strong>

    使用Binary Buddy Allocator来管理物理page
    关键的GFP flag用来通知binary buddy allocator如何寻找合适page
    gfp flag的详细定�?主要分成三部�? 一部分告诉分配器尝试从哪种zone分配,另外一大类用来通知allocator一些行�? 最�?提供了一些符合的常用GFP flag给开发者使�?详细介绍参见<Professional Linux Kernel Architecture> 217-218�?

    6.4.1中提到的3个PF_xxx变量,只有第一个在3.0的内核中还存在了.
    http://lxr.linux.no/#linux+v3.1.5/include/linux/sched.h

�?章非连续内存的分�?/strong>

    VMALLOC_RESERVE
    x86和arm都是128M.
    使用vm_struct来描述vmalloc的地址空间.

�?�?Slab分配�?/strong>

    关键的数据结构kmem_cache , slab

防止分配小内存造成的内存碎�?/span>

提高分配常用数据结构的效�?/span>

优化使用了L1, L2 cache, 对应的color机制就是为充分利用cache line, 使得不同的slab可以坐落于不同的cache line.

    2.6中原有的与GFP_xxx起同样做用的SLAB_xxx已经被移除了

�? Table 8.5: Cache Allocation Flags中所述的flag全部用对应GFP_xxx所代替�?

如何判断一个slab属于哪一种slab(free, full, partial), 通过page 结构体中�?/span>

由于slab内object的大小关�? 有两种存储slab 描述结构体的地方
与object在一�?/span>

与object分离

    使用kmem_bufctl_t来把所有slab链接起来

    使用kmem_list3和cache_chche来描述slab allocator自身

�?�?高端内存管理

    highmemory的管理不同的体系结构都有类似但各不相同的实现, 具体的体系结构参考相应arch下的mm/highmem.c

    PKMap (Permanent Kernel Map)

可以从dmesg的信息中看到, x86和arm下的意义是一样的, 只不过定义的起始位置不同.
arm的参照Documentation/arm/memory.txt
x86的是从PKMAP_BASE 到FIXADDR_START

    关键的进行映射的函数是kmap_high.

之前会进行一系列的检�? 2.6中的实现已经�?.4中不同了, 所以请参照本章最后对2.6的描�?

同时, kmap_high中会用到map_new_virtual, 它会用到一个常用的技�? 即从上次扫描处接着扫描.(内核中多处用到这样的思想) 并且它会在合适的时�?last_pkmap_nr�?, 表示所有的pkmap的entry都被检查过�?,进行TLB的fush动作!!

    http://lxr.linux.no/#linux+v3.1.6/arch/arm/include/asm/pgtable.h#L102中关于arm体系下的page table的描�? 值得去好好理�?

    关于kmap_atomic, 不同的体系结构的实现都不�? 实现中对不同的CPU预留给atomic的kmap的相关slot都有不同的定义和限制, 使用�?需要明确是否真的需要使用atomic.
    关于bounce buffer

最新的内核�? 融合了bio与bounce的概�? 所以所提供的接口与2.4已经不一样了.

参照:http://lxr.linux.no/#linux+v3.1.6/mm/bounce.c

而且, 相关的概念也不一样了, 参见biodoc.txt中的描述. bounce的逻辑只在不能对i/o进行操作的device上才适用, 这样绕过了下面所述使用bounce buffer时的效率问题.设备的驱动程序有义务明确自己在应对high memory时的能力.

使用bounce buffer会对I/O的性能有所影响, 所以有一些patch用来绕过�? 如http://tldp.org/HOWTO/IO-Perf-HOWTO/overview.html

第十�?页帧的回�?/strong>

    内核使用LRU来管理除了Slab管理的一些page
    kernel中的LRU跟标准的LRU是有所区别�?它以一个page是否被访问过来作为LRU的时间依�?, kernel把page分为active和inactive�? 根据一些标准把page在这两个list中迁移或释放.
    其它相关知识参照ULK和Professional Linux Kernel Architecture的相关描�?!!

第十一�?Swap管理

    进程私有的page和一些匿名页的管理与前几章不�? 它们需要去swap设备进行互动,从而达到更好的性能.

Swap cache对于shared page来说很重�?

使用swap_info_struct来描述所有的swap区域, 相关的结构体成员的描述参看代码中的描�?书中的描述只能做简单参�?

系统定义的所有swap区域个数由swap_info数组来决�? 在没有指定memory migration和hardware poison的时�?默认�?2个swap区域. 除了有这个数组来把各个swap区域联系起来, 同样有一个通用的list来把它们串联起来swap_list.

其中, 关于对更多swap区域的讨论也是一些特殊场景下的考量因素.

This would imply supporting 64 swap areas is not worth the additional complexity but there are cases where a large number of swap areas would be desirable even if the overall swap available does not increase. Some modern machines2 have many separate disks which between them can create a large number of separate block devices. In this case, it is desirable to create a large number of small swap areas which are evenly distributed across all disks. This would allow a high degree of parallelism in the page swapping behaviour which is important for swap intensive applications.

    每个swap区域都是以page大小划分的一些slot, 第一个slot按照惯例一般都是些对该swap的总体信息,所以不能被使用和overwrite. 它使用swap_header来进行描�?-它十一个union,用来支持�?老两种swap的管理格�?

这个结构�?.4�?.0基本没有变化:)

    如何从page entry映射到swap entry

PTE的中保存了对应的page在swap 数组的index,及相应的swap map的偏�? 由体系结构无关的swp_entry_t结构体来帮助保存相关数据及标志位. 这里提供了两个inline函数帮助做pte到swp的转�? 其中各个体系结构需要提供各自的pte的相关函数来转换为体系结构无关的swp_entry_t. �? http://lxr.linux.no/#linux+v3.1.6/include/linux/swapops.h#L62所�?/span>

下图, 以x86为例

    swap_info_struct→swap_map用来跟踪一个swap区域中的所有page slot.

Linux 通过尝试把page按照磁盘上的SWAPFILE_CLUSTER大小来组�?从而提高性能. 相关的理�? 请参照书中的描述
Linux attempts to organise pages into clusters on disk of size SWAPFILE_CLUSTER. It allocates SWAPFILE_CLUSTER number of pages sequentially in swap keeping count of the number of sequentially allocated pages in swap_info_struct→cluster_nr and records the current offset in swap_info_struct→cluster_next. Once a sequential block has been allocated, it searches for a block of free entries of size SWAPFILE_CLUSTER. If a block large enough can be found, it will be used as another cluster sized sequence.

If no free clusters large enough can be found in the swap area, a simple first-free search starting from swap_info_struct→lowest_bit is performed. The aim is to have pages swapped out at the same time close together on the premise that pages swapped out together are related. This premise, which seems strange at first glance, is quite solid when it is considered that the page replacement algorithm will use swap space most when linearly scanning the process address space swapping out pages. Without scanning for large free blocks and using them, it is likely that the scanning would degenerate to first-free searches and never improve. With it, processes exiting are likely to free up large blocks of slots.

    被多个进程共享的page的swap动作, 尤其需要注�? 因此, 引入了专为shared page设计的swap cache. 它与普通的page cache的区别主要有以下两点:\
        swap cache总是使用swapper_space作为page->mapping的address_space, 普通的page cache的page->mapping使用的是address_space
        把page加入到swap cache使用的专用的add_to_swap_cache()而不是通用的add_page_to_cache

同时,需要注意的�? 另一类匿名page, 在需要考虑将它们换出的时�? 需要使用swap cache机制.

    如何从backing storage读入page

入口函数是read_swap_cache_async, 它一般都是在发生page fault时�? 被系统调�?

如何把page写入到backing storage
按照惯例, 它是由swapper_space中注册的swap_aops中的writepage的callback函数来处理这种系统请�?

读写swap区域的blocks
这里的描述已经跟2.6不一样了, 已经整合到swap_aops中的回调中来完成相应的工�?

�?2�?共享内存虚拟文件系统

    共享内存

通过使用MAP_SHARED参数调用mmap(), 把一块作为文件或者设备的内存映射为共享内�?/span>

匿名的一块内存在没有作为文件或者设备作为它的back�?以MAP_SHARED参数调用mmap(), 会使其变为共享内�? 此时, 这块内存会由调用fork()的父子进程共�?/span>

还有一种情况就是使用shmget/shmat来明确设定一块共享内�?/span>

    所有没有跟文件或者设备进行交互的匿名�? 内核为了维护接口的一致�? 为这些匿名页使用RAM-based文件系统, 使得看起来它们也是以"文件"作为交互.

shm和tmpfs
两者都以init_tmpfs()为入口点, shm是由kernel在启动过程中加载并初始化, 而tmpfs一般要有系统管理员挂载(内核启动之后, 由启动脚本完�?.

    shmem_inode_info是共享内存在内核中的inode描述结构.(不同与普通文件和设备文件的inode)
    2.4�?.0,该结构还是做了较多变�? 用到的时候需要参照最新代码中的注�?
    符合标准VFS的结�? 也定义了对应的shmem_ops(用以操作shm file)/shmem_vm_ops(用以操作匿名VMA)及其callback函数
    提供标准�?组inode的inode_operation结构及callback函数(inode, dir, symlink... )

这些都在http://lxr.linux.no/#linux+v3.1.6/mm/shmem.c#L2286

tmpfs所提供的一些文件操作接�? 都是以shmem的实现为基础
tmpfs的文件系统也会挂在到/dev/shm, 作为进程间通行的一种手�? 参见tmpfs.txt

    virtual file如何处理page fault
    如果该page被swap出去�? 那么通过inode中的文件系统特有的private信息来找到该page, 而不是使用PTE中的信息.

关键函数是shmem_getpage_gfp
文中提到�?对于shmem_inode_info的direct, indirect的索引得了利�? 在最新的内核中已经不再这样用�? 具体情况请参见https://lkml.org/lkml/2011/6/14/134和http://lwn.net/Articles/447378/, 这个patchset完成了一个很大改�? 是shmem更好的支�?4bit, 引入radix tree, ....

第十三章 OOM管理

    用一些统计值判断是否有足够的空�?�?/span>

Total page cache as page cache is easily reclaimed

Total free pages because they are already available

Total free swap pages as userspace pages may be paged out

Total pages managed by swapper_space although this double-counts the free swap

pages. This is balanced by the fact that slots are sometimes reserved but not used

Total pages used by the dentry cache as they are easily reclaimed

Total pages used by the inode cache as they are easily reclaimed

           如果经过一些努力依然无法满足需要会进入关键的out_of_memory处理逻辑.(各个体系结构会有一些各自的特殊处理), 参看我摘录的一篇好文章

    如何选择一个victim,是一个问�?/span>
    相关的实现参照ULK, Professional Linux Kernel Architecture.
    2.6之后提供了新的对VMA可用空间的VM_ACCOUNT的机�?/span>

也提供了对security的支�? 允许一些kernel api可以被security提供的实现所覆盖, 详见security/security.c

最后一些列的Appendix xx针对各章的关键函�?做了简单的介绍及注�?