Linux开发心得总结14 - linux情景分析--存储管理

来源：互联网发布：ios 模仿淘宝商品详情编辑：程序博客网时间：2024/05/22 13:01

2.1 linux内存管理基本框架

linux中的分段分页机制分三层，页目录（PGD），中间目录(PMD)，页表(PT)。PT中的表项称为页表项（PTE）。注意英文缩写，在linux程序中函数变量的名字等都会和英文缩写相关。

LINUX中的三级映射流程如图：

但是arm结构的MMU在硬件只有2级映射，所以在软件上会跳过PMD表。即：在PGD中直接放的是PT的base address。在linux软件上就是：

[cpp] view plaincopyprint?

#define PMD_SHIFT 21
#define PGDIR_SHIFT 21 //让PMD_SHIFT 和 PGDIR_SHIFT 相等就可以了。

#define PMD_SHIFT21#define PGDIR_SHIFT21             //让PMD_SHIFT 和 PGDIR_SHIFT 相等就可以了。

新的2.6内核和内核源码情景分析上的差别挺大的，在2.6.11版本以后，linux将软件上的3级映射变成了4级映射，在PMD后面增加了一个PUD(page upper directory). 在arm的两级映射中，跳过PMD和PUD

在2.6.39内核arch/arm/include/asm/pgtable.h中有下代码：

[cpp] view plaincopyprint?

/*
* Hardware-wise, we have a two level page table structure, where the first
* level has 4096 entries, and the second level has 256 entries. Each entry
* is one 32-bit word. Most of the bits in the second level entry are used
* by hardware, and there aren't any "accessed" and "dirty" bits.
*
* Linux on the other hand has a three level page table structure, which can
* be wrapped to fit a two level page table structure easily - using the PGD
* and PTE only. However, Linux also expects one "PTE" table per page, and
* at least a "dirty" bit.
*
* Therefore, we tweak the implementation slightly - we tell Linux that we
* have 2048 entries in the first level, each of which is 8 bytes (iow, two
* hardware pointers to the second level.) The second level contains two
* hardware PTE tables arranged contiguously, preceded by Linux versions
* which contain the state information Linux needs. We, therefore, end up
* with 512 entries in the "PTE" level.
*
* This leads to the page tables having the following layout:
*
* pgd pte
* | |
* +--------+
* | | +------------+ +0
* +- - - - + | Linux pt 0 |
* | | +------------+ +1024
* +--------+ +0 | Linux pt 1 |
* | |-----> +------------+ +2048
* +- - - - + +4 | h/w pt 0 |
* | |-----> +------------+ +3072
* +--------+ +8 | h/w pt 1 |
* | | +------------+ +4096
*
* See L_PTE_xxx below for definitions of bits in the "Linux pt", and
* PTE_xxx for definitions of bits appearing in the "h/w pt".
*
* PMD_xxx definitions refer to bits in the first level page table.
*
* The "dirty" bit is emulated by only granting hardware write permission
* iff the page is marked "writable" and "dirty" in the Linux PTE. This
* means that a write to a clean page will cause a permission fault, and
* the Linux MM layer will mark the page dirty via handle_pte_fault().
* For the hardware to notice the permission change, the TLB entry must
* be flushed, and ptep_set_access_flags() does that for us.
*
* The "accessed" or "young" bit is emulated by a similar method; we only
* allow accesses to the page if the "young" bit is set. Accesses to the
* page will cause a fault, and handle_pte_fault() will set the young bit
* for us as long as the page is marked present in the corresponding Linux
* PTE entry. Again, ptep_set_access_flags() will ensure that the TLB is
* up to date.
*
* However, when the "young" bit is cleared, we deny access to the page
* by clearing the hardware PTE. Currently Linux does not flush the TLB
* for us in this case, which means the TLB will retain the transation
* until either the TLB entry is evicted under pressure, or a context
* switch which changes the user space mapping occurs.
*/

/* * Hardware-wise, we have a two level page table structure, where the first * level has 4096 entries, and the second level has 256 entries.  Each entry * is one 32-bit word.  Most of the bits in the second level entry are used * by hardware, and there aren't any "accessed" and "dirty" bits. * * Linux on the other hand has a three level page table structure, which can * be wrapped to fit a two level page table structure easily - using the PGD * and PTE only.  However, Linux also expects one "PTE" table per page, and * at least a "dirty" bit. * * Therefore, we tweak the implementation slightly - we tell Linux that we * have 2048 entries in the first level, each of which is 8 bytes (iow, two * hardware pointers to the second level.)  The second level contains two * hardware PTE tables arranged contiguously, preceded by Linux versions * which contain the state information Linux needs.  We, therefore, end up * with 512 entries in the "PTE" level. * * This leads to the page tables having the following layout: * *    pgd             pte * |        | * +--------+ * |        |       +------------+ +0 * +- - - - +       | Linux pt 0 | * |        |       +------------+ +1024 * +--------+ +0    | Linux pt 1 | * |        |-----> +------------+ +2048 * +- - - - + +4    |  h/w pt 0  | * |        |-----> +------------+ +3072 * +--------+ +8    |  h/w pt 1  | * |        |       +------------+ +4096 * * See L_PTE_xxx below for definitions of bits in the "Linux pt", and * PTE_xxx for definitions of bits appearing in the "h/w pt". * * PMD_xxx definitions refer to bits in the first level page table. * * The "dirty" bit is emulated by only granting hardware write permission * iff the page is marked "writable" and "dirty" in the Linux PTE.  This * means that a write to a clean page will cause a permission fault, and * the Linux MM layer will mark the page dirty via handle_pte_fault(). * For the hardware to notice the permission change, the TLB entry must * be flushed, and ptep_set_access_flags() does that for us. * * The "accessed" or "young" bit is emulated by a similar method; we only * allow accesses to the page if the "young" bit is set.  Accesses to the * page will cause a fault, and handle_pte_fault() will set the young bit * for us as long as the page is marked present in the corresponding Linux * PTE entry.  Again, ptep_set_access_flags() will ensure that the TLB is * up to date. * * However, when the "young" bit is cleared, we deny access to the page * by clearing the hardware PTE.  Currently Linux does not flush the TLB * for us in this case, which means the TLB will retain the transation * until either the TLB entry is evicted under pressure, or a context * switch which changes the user space mapping occurs. */

[cpp] view plaincopyprint?

#define PTRS_PER_PTE 512 //PTE的个数
#define PTRS_PER_PMD 1
#define PTRS_PER_PGD 2048#define PTE_HWTABLE_PTRS (PTRS_PER_PTE)
#define PTE_HWTABLE_OFF (PTE_HWTABLE_PTRS * sizeof(pte_t))
#define PTE_HWTABLE_SIZE (PTRS_PER_PTE * sizeof(u32))/*
* PMD_SHIFT determines the size of the area a second-level page table can map
* PGDIR_SHIFT determines what a third-level page table entry can map
*/
#define PMD_SHIFT 21
#define PGDIR_SHIFT 21 //另PMD和PDGIR相等，来跳过PMD。

<p>#define PTRS_PER_PTE  512                              //PTE的个数#define PTRS_PER_PMD  1#define PTRS_PER_PGD  2048</p><p>#define PTE_HWTABLE_PTRS (PTRS_PER_PTE)#define PTE_HWTABLE_OFF  (PTE_HWTABLE_PTRS * sizeof(pte_t))#define PTE_HWTABLE_SIZE (PTRS_PER_PTE * sizeof(u32))</p><p>/* * PMD_SHIFT determines the size of the area a second-level page table can map * PGDIR_SHIFT determines what a third-level page table entry can map */#define PMD_SHIFT  21#define PGDIR_SHIFT  21 //另PMD和PDGIR相等，来跳过PMD。</p>

/*linux将PGD为2k，每项为8个byte。(MMU中取值为前高12bit，为4k，每项4个byte，但linux为什么要这样做呢？) ，另外linux在定义pte时，定义了两个pte，一个供MMU使用，一个供linux使用，来用描述这个页。
*根据注释中的表图，我看到有 #define PTRS_PER_PTE 512 #define PTRS_PER_PGD 2048 ， linux将PDG定义为2K,8byte,每个pte项为512，4byte。他将两个pte项进行了一下合并。为什么？为什么？

**/

在进程中，传说是可以看到4G的空间，按照linux的用户空间和内核空间划分。其实进程可以看到3G的自己进程的空间，3G-4G的空间是内核空间，进程仅能通过系统调用进入。

2.2地址映射全过程

将x86的段地址，略。。。

2.3几个重要的数据结构和函数

PGD,PTE的值定义：

[cpp] view plaincopyprint?

/*
* These are used to make use of C type-checking..
*/
typedef struct { pteval_t pte; } pte_t;
typedef struct { unsigned long pmd; } pmd_t;
typedef struct { unsigned long pgd[2]; } pgd_t; //定义一个[2]数组，这样就和上面的介绍对应起来了，每个PGD是一个8个byte的值（即两个long型数）。
typedef struct { unsigned long pgprot; } pgprot_t;

/* * These are used to make use of C type-checking.. */typedef struct { pteval_t pte; } pte_t;typedef struct { unsigned long pmd; } pmd_t;typedef struct { unsigned long pgd[2]; } pgd_t; //定义一个[2]数组，这样就和上面的介绍对应起来了，每个PGD是一个8个byte的值（即两个long型数）。typedef struct { unsigned long pgprot; } pgprot_t;

这几个结构体定义PTE等他们的结构， pgprot_t 是page protect的意思，最上面的图可知，在具体映射时候，需要将PTE表中的高20bit的值 + 线性地址的低 12bit的值才是具体的物理地址。所以PTE只有高20bit是在地址映射映射时有效的，那么他的低12bit用来放一些protect的数据，如writeable，rdonly......下面是pgprot_t的一些值：

[cpp] view plaincopyprint?

#define __PAGE_NONE __pgprot(_L_PTE_DEFAULT | L_PTE_RDONLY | L_PTE_XN)
#define __PAGE_SHARED __pgprot(_L_PTE_DEFAULT | L_PTE_USER | L_PTE_XN)
#define __PAGE_SHARED_EXEC __pgprot(_L_PTE_DEFAULT | L_PTE_USER)
#define __PAGE_COPY __pgprot(_L_PTE_DEFAULT | L_PTE_USER | L_PTE_RDONLY | L_PTE_XN)
#define __PAGE_COPY_EXEC __pgprot(_L_PTE_DEFAULT | L_PTE_USER | L_PTE_RDONLY)
#define __PAGE_READONLY __pgprot(_L_PTE_DEFAULT | L_PTE_USER | L_PTE_RDONLY | L_PTE_XN)
#define __PAGE_READONLY_EXEC __pgprot(_L_PTE_DEFAULT | L_PTE_USER | L_PTE_RDONLY)

#define __PAGE_NONE__pgprot(_L_PTE_DEFAULT | L_PTE_RDONLY | L_PTE_XN)#define __PAGE_SHARED__pgprot(_L_PTE_DEFAULT | L_PTE_USER | L_PTE_XN)#define __PAGE_SHARED_EXEC__pgprot(_L_PTE_DEFAULT | L_PTE_USER)#define __PAGE_COPY__pgprot(_L_PTE_DEFAULT | L_PTE_USER | L_PTE_RDONLY | L_PTE_XN)#define __PAGE_COPY_EXEC__pgprot(_L_PTE_DEFAULT | L_PTE_USER | L_PTE_RDONLY)#define __PAGE_READONLY__pgprot(_L_PTE_DEFAULT | L_PTE_USER | L_PTE_RDONLY | L_PTE_XN)#define __PAGE_READONLY_EXEC__pgprot(_L_PTE_DEFAULT | L_PTE_USER | L_PTE_RDONLY)

所以当我们想做一个PTE时（即生成一个page的pte），调用下列函数：

[cpp] view plaincopyprint?

#define mk_pte(page,prot) pfn_pte(page_to_pfn(page), prot)

#define mk_pte(page,prot)pfn_pte(page_to_pfn(page), prot)

这个函数即是 page_to_pfn(page)得到高20bit的值 + prot的值。 prot即为pgprot_t的结构型值，低12bit。表示页的属性。

生成这一个pte后，我们要让这个pte生效，就要在将这个值赋值到对应的地方去。

set_pte（pteptr，pteval）：但是在2.6.39没有找到这个代码，过程应该差不多了，算了·····

另外，我们可以通过对PTE的低12bit设置，来让mmu判断，是否建立了映射？还是建立了映射但已经被swap出去了？

还有其他很多的函数，和宏定义，具体可以看《深入linux内核》的内存寻址这章。

这里都是说的PTE中的低12bit的，但是在2.6.39中明显多了一个linux pt，具体做什么用还不太清楚。但是linux在arch/arm/include/asm/pgtable.h 中由个注释

[cpp] view plaincopyprint?

/*
* "Linux" PTE definitions.
*
* We keep two sets of PTEs - the hardware and the linux version.
* This allows greater flexibility in the way we map the Linux bits
* onto the hardware tables, and allows us to have YOUNG and DIRTY
* bits.
*
* The PTE table pointer refers to the hardware entries; the "Linux"
* entries are stored 1024 bytes below.
*/

/* * "Linux" PTE definitions. * * We keep two sets of PTEs - the hardware and the linux version. * This allows greater flexibility in the way we map the Linux bits * onto the hardware tables, and allows us to have YOUNG and DIRTY * bits. * * The PTE table pointer refers to the hardware entries; the "Linux" * entries are stored 1024 bytes below. */

应该可以看出，linux pt和PTE的低12bit是相呼应的，用来记录page的一些信息。

linux中，每个物理页都有一个对应的page结构体，组成一个数组，放在mem_map中。

[cpp] view plaincopyprint?

/*
* Each physical page in the system has a struct page associated with
* it to keep track of whatever it is we are using the page for at the
* moment. Note that we have no way to track which tasks are using
* a page, though if it is a pagecache page, rmap structures can tell us
* who is mapping it.
*/
struct page {
unsigned long flags; /* Atomic flags, some possibly
* updated asynchronously */
atomic_t _count; /* Usage count, see below. */
union {
atomic_t _mapcount; /* Count of ptes mapped in mms,
* to show when page is mapped
* & limit reverse map searches.
*/
struct { /* SLUB */
u16 inuse;
u16 objects;
};
};
union {
struct {
unsigned long private; /* Mapping-private opaque data:
* usually used for buffer_heads
* if PagePrivate set; used for
* swp_entry_t if PageSwapCache;
* indicates order in the buddy
* system if PG_buddy is set.
*/
struct address_space *mapping; /* If low bit clear, points to
* inode address_space, or NULL.
* If page mapped as anonymous
* memory, low bit is set, and
* it points to anon_vma object:
* see PAGE_MAPPING_ANON below.
*/
};
#if USE_SPLIT_PTLOCKS
spinlock_t ptl;
#endif
struct kmem_cache *slab; /* SLUB: Pointer to slab */
struct page *first_page; /* Compound tail pages */
};
union {
pgoff_t index; /* Our offset within mapping. */
void *freelist; /* SLUB: freelist req. slab lock */
};
struct list_head lru; /* Pageout list, eg. active_list
* protected by zone->lru_lock !
*/
/*
* On machines where all RAM is mapped into kernel address space,
* we can simply calculate the virtual address. On machines with
* highmem some memory is mapped into kernel virtual memory
* dynamically, so we need a place to store that address.
* Note that this field could be 16 bits on x86 ... ;)
*
* Architectures with slow multiplication can define
* WANT_PAGE_VIRTUAL in asm/page.h
*/
#if defined(WANT_PAGE_VIRTUAL)
void *virtual; /* Kernel virtual address (NULL if
not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */
#ifdef CONFIG_WANT_PAGE_DEBUG_FLAGS
unsigned long debug_flags; /* Use atomic bitops on this */
#endif
#ifdef CONFIG_KMEMCHECK
/*
* kmemcheck wants to track the status of each byte in a page; this
* is a pointer to such a status block. NULL if not tracked.
*/
void *shadow;
#endif
};

/* * Each physical page in the system has a struct page associated with * it to keep track of whatever it is we are using the page for at the * moment. Note that we have no way to track which tasks are using * a page, though if it is a pagecache page, rmap structures can tell us * who is mapping it. */struct page {unsigned long flags;/* Atomic flags, some possibly * updated asynchronously */atomic_t _count;/* Usage count, see below. */union {atomic_t _mapcount;/* Count of ptes mapped in mms, * to show when page is mapped * & limit reverse map searches. */struct {/* SLUB */u16 inuse;u16 objects;};};union {    struct {unsigned long private;/* Mapping-private opaque data:  * usually used for buffer_heads * if PagePrivate set; used for * swp_entry_t if PageSwapCache; * indicates order in the buddy * system if PG_buddy is set. */struct address_space *mapping;/* If low bit clear, points to * inode address_space, or NULL. * If page mapped as anonymous * memory, low bit is set, and * it points to anon_vma object: * see PAGE_MAPPING_ANON below. */    };#if USE_SPLIT_PTLOCKS    spinlock_t ptl;#endif    struct kmem_cache *slab;/* SLUB: Pointer to slab */    struct page *first_page;/* Compound tail pages */};union {pgoff_t index;/* Our offset within mapping. */void *freelist;/* SLUB: freelist req. slab lock */};struct list_head lru;/* Pageout list, eg. active_list * protected by zone->lru_lock ! *//* * On machines where all RAM is mapped into kernel address space, * we can simply calculate the virtual address. On machines with * highmem some memory is mapped into kernel virtual memory * dynamically, so we need a place to store that address. * Note that this field could be 16 bits on x86 ... ;) * * Architectures with slow multiplication can define * WANT_PAGE_VIRTUAL in asm/page.h */#if defined(WANT_PAGE_VIRTUAL)void *virtual;/* Kernel virtual address (NULL if   not kmapped, ie. highmem) */#endif /* WANT_PAGE_VIRTUAL */#ifdef CONFIG_WANT_PAGE_DEBUG_FLAGSunsigned long debug_flags;/* Use atomic bitops on this */#endif#ifdef CONFIG_KMEMCHECK/* * kmemcheck wants to track the status of each byte in a page; this * is a pointer to such a status block. NULL if not tracked. */void *shadow;#endif};

页的page结构放在mem_map中。要想找到一个物理地址对应的page结构很简单，就是 mem_map[pfn]即可。

由于硬件的关系，会将整个内存分成不同的zone，：ZONE_DMA, ZONE_NORMAL, ZONE_HIGHMEM. 这样分的原因是：有些芯片的DMA设置只能设置到固定区域的地址，所以将这些区域作为ZONE_DMA以免其他操作占用。NORMAL就是正常的内存。高位内存内核空间因为只有1G不能全部映射，所以也要做特殊的处理，进行映射，才能使用。

这三个ZONE使用的描述符是：

[cpp] view plaincopyprint?

struct zone {
/* Fields commonly accessed by the page allocator */
/* zone watermarks, access with *_wmark_pages(zone) macros */
unsigned long watermark[NR_WMARK];
/*
* When free pages are below this point, additional steps are taken
* when reading the number of free pages to avoid per-cpu counter
* drift allowing watermarks to be breached
*/
unsigned long percpu_drift_mark;
/*
* We don't know if the memory that we're going to allocate will be freeable
* or/and it will be released eventually, so to avoid totally wasting several
* GB of ram we must reserve some of the lower zone memory (otherwise we risk
* to run OOM on the lower zones despite there's tons of freeable ram
* on the higher zones). This array is recalculated at runtime if the
* sysctl_lowmem_reserve_ratio sysctl changes.
*/
unsigned long lowmem_reserve[MAX_NR_ZONES];
#ifdef CONFIG_NUMA
int node;
/*
* zone reclaim becomes active if more unmapped pages exist.
*/
unsigned long min_unmapped_pages;
unsigned long min_slab_pages;
#endif
struct per_cpu_pageset __percpu *pageset;
/*
* free areas of different sizes
*/
spinlock_t lock;
int all_unreclaimable; /* All pages pinned */
#ifdef CONFIG_MEMORY_HOTPLUG
/* see spanned/present_pages for more description */
seqlock_t span_seqlock;
#endif
struct free_area free_area[MAX_ORDER];
#ifndef CONFIG_SPARSEMEM
/*
* Flags for a pageblock_nr_pages block. See pageblock-flags.h.
* In SPARSEMEM, this map is stored in struct mem_section
*/
unsigned long *pageblock_flags;
#endif /* CONFIG_SPARSEMEM */
#ifdef CONFIG_COMPACTION
/*
* On compaction failure, 1<<compact_defer_shift compactions
* are skipped before trying again. The number attempted since
* last failure is tracked with compact_considered.
*/
unsigned int compact_considered;
unsigned int compact_defer_shift;
#endif
ZONE_PADDING(_pad1_)
/* Fields commonly accessed by the page reclaim scanner */
spinlock_t lru_lock;
struct zone_lru {
struct list_head list;
} lru[NR_LRU_LISTS];
struct zone_reclaim_stat reclaim_stat;
unsigned long pages_scanned; /* since last reclaim */
unsigned long flags; /* zone flags, see below */
/* Zone statistics */
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
/*
* The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
* this zone's LRU. Maintained by the pageout code.
*/
unsigned int inactive_ratio;
ZONE_PADDING(_pad2_)
/* Rarely used or read-mostly fields */
/*
* wait_table -- the array holding the hash table
* wait_table_hash_nr_entries -- the size of the hash table array
* wait_table_bits -- wait_table_size == (1 << wait_table_bits)
*
* The purpose of all these is to keep track of the people
* waiting for a page to become available and make them
* runnable again when possible. The trouble is that this
* consumes a lot of space, especially when so few things
* wait on pages at a given time. So instead of using
* per-page waitqueues, we use a waitqueue hash table.
*
* The bucket discipline is to sleep on the same queue when
* colliding and wake all in that wait queue when removing.
* When something wakes, it must check to be sure its page is
* truly available, a la thundering herd. The cost of a
* collision is great, but given the expected load of the
* table, they should be so rare as to be outweighed by the
* benefits from the saved space.
*
* __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the
* primary users of these fields, and in mm/page_alloc.c
* free_area_init_core() performs the initialization of them.
*/
wait_queue_head_t * wait_table;
unsigned long wait_table_hash_nr_entries;
unsigned long wait_table_bits;
/*
* Discontig memory support fields.
*/
struct pglist_data *zone_pgdat;
/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
unsigned long zone_start_pfn;
/*
* zone_start_pfn, spanned_pages and present_pages are all
* protected by span_seqlock. It is a seqlock because it has
* to be read outside of zone->lock, and it is done in the main
* allocator path. But, it is written quite infrequently.
*
* The lock is declared along with zone->lock because it is
* frequently read in proximity to zone->lock. It's good to
* give them a chance of being in the same cacheline.
*/
unsigned long spanned_pages; /* total size, including holes */
unsigned long present_pages; /* amount of memory (excluding holes) */
/*
* rarely used fields:
*/
const char *name;
} ____cacheline_internodealigned_in_smp;

struct zone {/* Fields commonly accessed by the page allocator *//* zone watermarks, access with *_wmark_pages(zone) macros */unsigned long watermark[NR_WMARK];/* * When free pages are below this point, additional steps are taken * when reading the number of free pages to avoid per-cpu counter * drift allowing watermarks to be breached */unsigned long percpu_drift_mark;/* * We don't know if the memory that we're going to allocate will be freeable * or/and it will be released eventually, so to avoid totally wasting several * GB of ram we must reserve some of the lower zone memory (otherwise we risk * to run OOM on the lower zones despite there's tons of freeable ram * on the higher zones). This array is recalculated at runtime if the * sysctl_lowmem_reserve_ratio sysctl changes. */unsigned longlowmem_reserve[MAX_NR_ZONES];#ifdef CONFIG_NUMAint node;/* * zone reclaim becomes active if more unmapped pages exist. */unsigned longmin_unmapped_pages;unsigned longmin_slab_pages;#endifstruct per_cpu_pageset __percpu *pageset;/* * free areas of different sizes */spinlock_tlock;int                     all_unreclaimable; /* All pages pinned */#ifdef CONFIG_MEMORY_HOTPLUG/* see spanned/present_pages for more description */seqlock_tspan_seqlock;#endifstruct free_areafree_area[MAX_ORDER];#ifndef CONFIG_SPARSEMEM/* * Flags for a pageblock_nr_pages block. See pageblock-flags.h. * In SPARSEMEM, this map is stored in struct mem_section */unsigned long*pageblock_flags;#endif /* CONFIG_SPARSEMEM */#ifdef CONFIG_COMPACTION/* * On compaction failure, 1<<compact_defer_shift compactions * are skipped before trying again. The number attempted since * last failure is tracked with compact_considered. */unsigned intcompact_considered;unsigned intcompact_defer_shift;#endifZONE_PADDING(_pad1_)/* Fields commonly accessed by the page reclaim scanner */spinlock_tlru_lock;struct zone_lru {struct list_head list;} lru[NR_LRU_LISTS];struct zone_reclaim_stat reclaim_stat;unsigned longpages_scanned;   /* since last reclaim */unsigned longflags;   /* zone flags, see below *//* Zone statistics */atomic_long_tvm_stat[NR_VM_ZONE_STAT_ITEMS];/* * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on * this zone's LRU.  Maintained by the pageout code. */unsigned int inactive_ratio;ZONE_PADDING(_pad2_)/* Rarely used or read-mostly fields *//* * wait_table-- the array holding the hash table * wait_table_hash_nr_entries-- the size of the hash table array * wait_table_bits-- wait_table_size == (1 << wait_table_bits) * * The purpose of all these is to keep track of the people * waiting for a page to become available and make them * runnable again when possible. The trouble is that this * consumes a lot of space, especially when so few things * wait on pages at a given time. So instead of using * per-page waitqueues, we use a waitqueue hash table. * * The bucket discipline is to sleep on the same queue when * colliding and wake all in that wait queue when removing. * When something wakes, it must check to be sure its page is * truly available, a la thundering herd. The cost of a * collision is great, but given the expected load of the * table, they should be so rare as to be outweighed by the * benefits from the saved space. * * __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the * primary users of these fields, and in mm/page_alloc.c * free_area_init_core() performs the initialization of them. */wait_queue_head_t* wait_table;unsigned longwait_table_hash_nr_entries;unsigned longwait_table_bits;/* * Discontig memory support fields. */struct pglist_data*zone_pgdat;/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */unsigned longzone_start_pfn;/* * zone_start_pfn, spanned_pages and present_pages are all * protected by span_seqlock.  It is a seqlock because it has * to be read outside of zone->lock, and it is done in the main * allocator path.  But, it is written quite infrequently. * * The lock is declared along with zone->lock because it is * frequently read in proximity to zone->lock.  It's good to * give them a chance of being in the same cacheline. */unsigned longspanned_pages;/* total size, including holes */unsigned longpresent_pages;/* amount of memory (excluding holes) *//* * rarely used fields: */const char*name;} ____cacheline_internodealigned_in_smp;

在这个描述符中，有一个struct free_area free_area[MAX_ORDER]; 这样的成员变量，这个数组的每个元素都包含一个page的list，那么，一共有MAX_ORDER个list。他是将所有的free page按照连续是否进行分组。有2个连续的page的，有4个连续的page的，有8个连续的page的，....2^MAX_ORDER。这样将page进行分类。当需要alloc一些page时，可以方便的从这些list中找连续的page。（slub中也会像这样来分类，但是slub中分类是按byte为单位，如2byte，4byte....）

另外还有一个

[cpp] view plaincopyprint?

struct zone_lru {
struct list_head list;
} lru[NR_LRU_LISTS];

struct zone_lru {struct list_head list;} lru[NR_LRU_LISTS];

[cpp] view plaincopyprint?

/*
* We do arithmetic on the LRU lists in various places in the code,
* so it is important to keep the active lists LRU_ACTIVE higher in
* the array than the corresponding inactive lists, and to keep
* the *_FILE lists LRU_FILE higher than the corresponding _ANON lists.
*
* This has to be kept in sync with the statistics in zone_stat_item
* above and the descriptions in vmstat_text in mm/vmstat.c
*/
#define LRU_BASE 0
#define LRU_ACTIVE 1
#define LRU_FILE 2enum lru_list {
LRU_INACTIVE_ANON = LRU_BASE,
LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
LRU_UNEVICTABLE,
NR_LRU_LISTS
};

<p>/* * We do arithmetic on the LRU lists in various places in the code, * so it is important to keep the active lists LRU_ACTIVE higher in * the array than the corresponding inactive lists, and to keep * the *_FILE lists LRU_FILE higher than the corresponding _ANON lists. * * This has to be kept in sync with the statistics in zone_stat_item * above and the descriptions in vmstat_text in mm/vmstat.c */#define LRU_BASE 0#define LRU_ACTIVE 1#define LRU_FILE 2</p><p>enum lru_list { LRU_INACTIVE_ANON = LRU_BASE, LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE, LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE, LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE, LRU_UNEVICTABLE, NR_LRU_LISTS};</p>

这样的成员变量，这也是一个list数组，其中NR_LRU_LISTS表示在enum 中定义的成员的个数（新的linux中好像有不少这样的用法），这个list数组中，每个list也是很多page结构体。LRU是一中算法，会将长时间不同的page给swap out到flash中。这个数组是LRU算法用的。其中有 inactive的page， active的page， inactive 的file，等//// 在后面对LRU介绍。

UMA和NUMA

在linux中，如果CPU访问所有的memory所需的时间都是一样的，那么我们人为这个系统是UMA(uniform memory architecture)，但是如果访问memory所需的时间是不一样的（在SMP（多核系统）是很常见的），那么我们人为这个系统是NUMA(Non-uniform memory architecture)。在linux中，系统会将内存分成几个node，每个node中的memory，CPU进入的时间的相等的。这个我们分配内存的时候就会从一个node进行分配。

[cpp] view plaincopyprint?

typedef struct pglist_data {
struct zone node_zones[MAX_NR_ZONES];
struct zonelist node_zonelists[MAX_ZONELISTS];
int nr_zones;
#ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
struct page *node_mem_map;
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
struct page_cgroup *node_page_cgroup;
#endif
#endif
#ifndef CONFIG_NO_BOOTMEM
struct bootmem_data *bdata;
#endif
#ifdef CONFIG_MEMORY_HOTPLUG
/*
* Must be held any time you expect node_start_pfn, node_present_pages
* or node_spanned_pages stay constant. Holding this will also
* guarantee that any pfn_valid() stays that way.
*
* Nests above zone->lock and zone->size_seqlock.
*/
spinlock_t node_size_lock;
#endif
unsigned long node_start_pfn;
unsigned long node_present_pages; /* total number of physical pages */
unsigned long node_spanned_pages; /* total size of physical page
range, including holes */
int node_id;
wait_queue_head_t kswapd_wait;
struct task_struct *kswapd;
int kswapd_max_order;
enum zone_type classzone_idx;
} pg_data_t;

typedef struct pglist_data {struct zone node_zones[MAX_NR_ZONES];struct zonelist node_zonelists[MAX_ZONELISTS];int nr_zones;#ifdef CONFIG_FLAT_NODE_MEM_MAP/* means !SPARSEMEM */struct page *node_mem_map;#ifdef CONFIG_CGROUP_MEM_RES_CTLRstruct page_cgroup *node_page_cgroup;#endif#endif#ifndef CONFIG_NO_BOOTMEMstruct bootmem_data *bdata;#endif#ifdef CONFIG_MEMORY_HOTPLUG/* * Must be held any time you expect node_start_pfn, node_present_pages * or node_spanned_pages stay constant.  Holding this will also * guarantee that any pfn_valid() stays that way. * * Nests above zone->lock and zone->size_seqlock. */spinlock_t node_size_lock;#endifunsigned long node_start_pfn;unsigned long node_present_pages; /* total number of physical pages */unsigned long node_spanned_pages; /* total size of physical page     range, including holes */int node_id;wait_queue_head_t kswapd_wait;struct task_struct *kswapd;int kswapd_max_order;enum zone_type classzone_idx;} pg_data_t;

在pglist_data结构中，一个成员变量node_zones[], 他代表了这个node下的所有zone。组成一个数组。

另一个成员变量node_zonelist[]. 这是一个list数组，每一个list中将不同node下的所有的zone都link到一起。数组中不同元素list中，zone的排列顺序不一样。这样当内核需要malloc一些page的时候，可能当前的这个node中并没有足够的page，那么就会按照这个数组list中的顺序依次去申请空间。内核在不同的情况下，需要按照不同的顺序申请空间。所以需要好几个不同的list，这些list就组成了这个数组。

每一个pglist_data结构对应一个node。这样，在每个zone结构上又多了一个node结构。这样内存页管理的结构应该是

node：同过UMA和NUMA将内存分成几个node。（在arm系统中，如果不是smp的一般都是一个node）

zone：在每个node中，再将内存分配成几个zone。

page：在每个zone中，对page进行管理，添加都各种list中。

for example：可以分析一下 for_each_zone这个宏定义，会发现就是先查找每个node，在每个node下进行zone的扫描。

上面的这些都是描述物理空间的，page，zone，node都是物理空间的管理结构，下面的结构体，描述虚拟空间

从物理空间来看，物理空间主要是“供”，他是实实在在存在的，主要目的就是向OS提供空间。

从虚拟空间来看，虚拟空间是“需”，他是虚拟的，主要是就发送需求。

*****当虚拟空间提出了 “需求”，但物理空间无法满足时候，就会进行swap out操作了。

vm_area_struct结构：这个结构描述进程的的内存空间的情况。

[cpp] view plaincopyprint?

/*
* This struct defines a memory VMM memory area. There is one of these
* per VM-area/task. A VM area is any part of the process virtual memory
* space that has a special rule for the page-fault handlers (ie a shared
* library, the executable area etc).
*/
struct vm_area_struct {
struct mm_struct * vm_mm; /* The address space we belong to. */
unsigned long vm_start; /* Our start address within vm_mm. */
unsigned long vm_end; /* The first byte after our end address
within vm_mm. */
/* linked list of VM areas per task, sorted by address */
struct vm_area_struct *vm_next, *vm_prev;
pgprot_t vm_page_prot; /* Access permissions of this VMA. */
unsigned long vm_flags; /* Flags, see mm.h. */
struct rb_node vm_rb;
/*
* For areas with an address space and backing store,
* linkage into the address_space->i_mmap prio tree, or
* linkage to the list of like vmas hanging off its node, or
* linkage of vma in the address_space->i_mmap_nonlinear list.
*/
union {
struct {
struct list_head list;
void *parent; /* aligns with prio_tree_node parent */
struct vm_area_struct *head;
} vm_set;
struct raw_prio_tree_node prio_tree_node;
} shared;
/*
* A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
* list, after a COW of one of the file pages. A MAP_SHARED vma
* can only be in the i_mmap tree. An anonymous MAP_PRIVATE, stack
* or brk vma (with NULL file) can only be in an anon_vma list.
*/
struct list_head anon_vma_chain; /* Serialized by mmap_sem &
* page_table_lock */
struct anon_vma *anon_vma; /* Serialized by page_table_lock */
/* Function pointers to deal with this struct. */
const struct vm_operations_struct *vm_ops;
/* Information about our backing store: */
unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
units, *not* PAGE_CACHE_SIZE */
struct file * vm_file; /* File we map to (can be NULL). */
void * vm_private_data; /* was vm_pte (shared mem) */
unsigned long vm_truncate_count;/* truncate_count or restart_addr */
#ifndef CONFIG_MMU
struct vm_region *vm_region; /* NOMMU mapping region */
#endif
#ifdef CONFIG_NUMA
struct mempolicy *vm_policy; /* NUMA policy for the VMA */
#endif
};

/* * This struct defines a memory VMM memory area. There is one of these * per VM-area/task.  A VM area is any part of the process virtual memory * space that has a special rule for the page-fault handlers (ie a shared * library, the executable area etc). */struct vm_area_struct {struct mm_struct * vm_mm;/* The address space we belong to. */unsigned long vm_start;/* Our start address within vm_mm. */unsigned long vm_end;/* The first byte after our end address   within vm_mm. *//* linked list of VM areas per task, sorted by address */struct vm_area_struct *vm_next, *vm_prev;pgprot_t vm_page_prot;/* Access permissions of this VMA. */unsigned long vm_flags;/* Flags, see mm.h. */struct rb_node vm_rb;/* * For areas with an address space and backing store, * linkage into the address_space->i_mmap prio tree, or * linkage to the list of like vmas hanging off its node, or * linkage of vma in the address_space->i_mmap_nonlinear list. */union {struct {struct list_head list;void *parent;/* aligns with prio_tree_node parent */struct vm_area_struct *head;} vm_set;struct raw_prio_tree_node prio_tree_node;} shared;/* * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma * list, after a COW of one of the file pages.A MAP_SHARED vma * can only be in the i_mmap tree.  An anonymous MAP_PRIVATE, stack * or brk vma (with NULL file) can only be in an anon_vma list. */struct list_head anon_vma_chain; /* Serialized by mmap_sem &  * page_table_lock */struct anon_vma *anon_vma;/* Serialized by page_table_lock *//* Function pointers to deal with this struct. */const struct vm_operations_struct *vm_ops;/* Information about our backing store: */unsigned long vm_pgoff;/* Offset (within vm_file) in PAGE_SIZE   units, *not* PAGE_CACHE_SIZE */struct file * vm_file;/* File we map to (can be NULL). */void * vm_private_data;/* was vm_pte (shared mem) */unsigned long vm_truncate_count;/* truncate_count or restart_addr */#ifndef CONFIG_MMUstruct vm_region *vm_region;/* NOMMU mapping region */#endif#ifdef CONFIG_NUMAstruct mempolicy *vm_policy;/* NUMA policy for the VMA */#endif};

这个数据结构在程序的变量名字常常是vma。

这个结构是成员变量， vm_start 和vm_end表示一段连续的虚拟区间。但是并不是一个连续的虚拟区间就可以用一个vm_area_struct结构来表示。而是要虚拟地址连续，并且这段空间的属性/访问权限等也相同才可以。

所以这段空间的属于和权限用

[cpp] view plaincopyprint?

pgprot_t vm_page_prot; /* Access permissions of this VMA. */
unsigned long vm_flags; /* Flags, see mm.h. */

pgprot_t vm_page_prot;  /* Access permissions of this VMA. */ unsigned long vm_flags;  /* Flags, see mm.h. */

这两个成员变量来表示。看到这两个成员变量可算看到亲人了。还记得那个大明湖畔的容嬷嬷妈？-------pgprot_t vm_page_prot；

每一个进程的所有vm_erea_struct结构通过

[cpp] view plaincopyprint?

/* linked list of VM areas per task, sorted by address */
struct vm_area_struct *vm_next, *vm_prev;

/* linked list of VM areas per task, sorted by address */struct vm_area_struct *vm_next, *vm_prev;

这个指针link起来。

当然，有上面这种链表的链接方式，查找起来是很麻烦的，而且一个进程中可能会有好多个这样的结构，所以除了上面的链表，还要有一中更有效的查找方式：

[cpp] view plaincopyprint?

/*
* For areas with an address space and backing store,
* linkage into the address_space->i_mmap prio tree, or
* linkage to the list of like vmas hanging off its node, or
* linkage of vma in the address_space->i_mmap_nonlinear list.
*/
union {
struct {
struct list_head list;
void *parent; /* aligns with prio_tree_node parent */
struct vm_area_struct *head;
} vm_set;
struct raw_prio_tree_node prio_tree_node;
} shared;

<span style="font-size:13px;">/* * For areas with an address space and backing store, * linkage into the address_space->i_mmap prio tree, or * linkage to the list of like vmas hanging off its node, or * linkage of vma in the address_space->i_mmap_nonlinear list. */union {struct {struct list_head list;void *parent;/* aligns with prio_tree_node parent */struct vm_area_struct *head;} vm_set;struct raw_prio_tree_node prio_tree_node;} shared;</span>

有两种情况虚拟空间会与磁盘flash发生关系。

1. 当内存分配失败，没有足够的内存时候，会发生swap。

2. linux进行mmap系统时候，会将磁盘上的内存map到用户空间，像访问memory一样，直接访问文件内容。

为了迎合1. vm_area_struct 的mapping，vm_next_share,vm_pprev_share,vm_file等。但是在2.6.39中没找到这些变量。仅有：（在page结构体中，也有swap相关的信息）

[cpp] view plaincopyprint?

/* Information about our backing store: */
unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
units, *not* PAGE_CACHE_SIZE */
struct file * vm_file; /* File we map to (can be NULL). */
void * vm_private_data; /* was vm_pte (shared mem) */
unsigned long vm_truncate_count;/* truncate_count or restart_addr */

/* Information about our backing store: */unsigned long vm_pgoff;/* Offset (within vm_file) in PAGE_SIZE   units, *not* PAGE_CACHE_SIZE */struct file * vm_file;/* File we map to (can be NULL). */void * vm_private_data;/* was vm_pte (shared mem) */unsigned long vm_truncate_count;/* truncate_count or restart_addr */

在后面再分析这些结构。

为了迎合2. vm_area_struct结构提供了

[cpp] view plaincopyprint?

/* Function pointers to deal with this struct. */
const struct vm_operations_struct *vm_ops;

/* Function pointers to deal with this struct. */const struct vm_operations_struct *vm_ops;

这样的变量，进行操作。

在vm_area_struct结构中，有个mm_struct结构，注释说他属于这个结构，由此看来，mm_struct结构应该是vm_area_struct结构的上层。

[cpp] view plaincopyprint?

struct mm_struct {
struct vm_area_struct * mmap; /* list of VMAs */
struct rb_root mm_rb;
struct vm_area_struct * mmap_cache; /* last find_vma result */
#ifdef CONFIG_MMU
unsigned long (*get_unmapped_area) (struct file *filp,
unsigned long addr, unsigned long len,
unsigned long pgoff, unsigned long flags);
void (*unmap_area) (struct mm_struct *mm, unsigned long addr);
#endif
unsigned long mmap_base; /* base of mmap area */
unsigned long task_size; /* size of task vm space */
unsigned long cached_hole_size; /* if non-zero, the largest hole below free_area_cache */
unsigned long free_area_cache; /* first hole of size cached_hole_size or larger */
pgd_t * pgd;
atomic_t mm_users; /* How many users with user space? */
atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */
int map_count; /* number of VMAs */
spinlock_t page_table_lock; /* Protects page tables and some counters */
struct rw_semaphore mmap_sem;
struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung
* together off init_mm.mmlist, and are protected
* by mmlist_lock
*/
unsigned long hiwater_rss; /* High-watermark of RSS usage */
unsigned long hiwater_vm; /* High-water virtual memory usage */
unsigned long total_vm, locked_vm, shared_vm, exec_vm;
unsigned long stack_vm, reserved_vm, def_flags, nr_ptes;
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;
unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
/*
* Special counters, in some configurations protected by the
* page_table_lock, in other configurations by being atomic.
*/
struct mm_rss_stat rss_stat;
struct linux_binfmt *binfmt;
cpumask_t cpu_vm_mask;
/* Architecture-specific MM context */
mm_context_t context;
/* Swap token stuff */
/*
* Last value of global fault stamp as seen by this process.
* In other words, this value gives an indication of how long
* it has been since this task got the token.
* Look at mm/thrash.c
*/
unsigned int faultstamp;
unsigned int token_priority;
unsigned int last_interval;
/* How many tasks sharing this mm are OOM_DISABLE */
atomic_t oom_disable_count;
unsigned long flags; /* Must use atomic bitops to access the bits */
struct core_state *core_state; /* coredumping support */
#ifdef CONFIG_AIO
spinlock_t ioctx_lock;
struct hlist_head ioctx_list;
#endif
#ifdef CONFIG_MM_OWNER
/*
* "owner" points to a task that is regarded as the canonical
* user/owner of this mm. All of the following must be true in
* order for it to be changed:
*
* current == mm->owner
* current->mm != mm
* new_owner->mm == mm
* new_owner->alloc_lock is held
*/
struct task_struct __rcu *owner;
#endif
#ifdef CONFIG_PROC_FS
/* store ref to file /proc/<pid>/exe symlink points to */
struct file *exe_file;
unsigned long num_exe_file_vmas;
#endif
#ifdef CONFIG_MMU_NOTIFIER
struct mmu_notifier_mm *mmu_notifier_mm;
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
pgtable_t pmd_huge_pte; /* protected by page_table_lock */
#endif
};

struct mm_struct {struct vm_area_struct * mmap;/* list of VMAs */struct rb_root mm_rb;struct vm_area_struct * mmap_cache;/* last find_vma result */#ifdef CONFIG_MMUunsigned long (*get_unmapped_area) (struct file *filp,unsigned long addr, unsigned long len,unsigned long pgoff, unsigned long flags);void (*unmap_area) (struct mm_struct *mm, unsigned long addr);#endifunsigned long mmap_base;/* base of mmap area */unsigned long task_size;/* size of task vm space */unsigned long cached_hole_size; /* if non-zero, the largest hole below free_area_cache */unsigned long free_area_cache;/* first hole of size cached_hole_size or larger */pgd_t * pgd;atomic_t mm_users;/* How many users with user space? */atomic_t mm_count;/* How many references to "struct mm_struct" (users count as 1) */int map_count;/* number of VMAs */spinlock_t page_table_lock;/* Protects page tables and some counters */struct rw_semaphore mmap_sem;struct list_head mmlist;/* List of maybe swapped mm's.These are globally strung * together off init_mm.mmlist, and are protected * by mmlist_lock */unsigned long hiwater_rss;/* High-watermark of RSS usage */unsigned long hiwater_vm;/* High-water virtual memory usage */unsigned long total_vm, locked_vm, shared_vm, exec_vm;unsigned long stack_vm, reserved_vm, def_flags, nr_ptes;unsigned long start_code, end_code, start_data, end_data;unsigned long start_brk, brk, start_stack;unsigned long arg_start, arg_end, env_start, env_end;unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv *//* * Special counters, in some configurations protected by the * page_table_lock, in other configurations by being atomic. */struct mm_rss_stat rss_stat;struct linux_binfmt *binfmt;cpumask_t cpu_vm_mask;/* Architecture-specific MM context */mm_context_t context;/* Swap token stuff *//* * Last value of global fault stamp as seen by this process. * In other words, this value gives an indication of how long * it has been since this task got the token. * Look at mm/thrash.c */unsigned int faultstamp;unsigned int token_priority;unsigned int last_interval;/* How many tasks sharing this mm are OOM_DISABLE */atomic_t oom_disable_count;unsigned long flags; /* Must use atomic bitops to access the bits */struct core_state *core_state; /* coredumping support */#ifdef CONFIG_AIOspinlock_tioctx_lock;struct hlist_headioctx_list;#endif#ifdef CONFIG_MM_OWNER/* * "owner" points to a task that is regarded as the canonical * user/owner of this mm. All of the following must be true in * order for it to be changed: * * current == mm->owner * current->mm != mm * new_owner->mm == mm * new_owner->alloc_lock is held */struct task_struct __rcu *owner;#endif#ifdef CONFIG_PROC_FS/* store ref to file /proc/<pid>/exe symlink points to */struct file *exe_file;unsigned long num_exe_file_vmas;#endif#ifdef CONFIG_MMU_NOTIFIERstruct mmu_notifier_mm *mmu_notifier_mm;#endif#ifdef CONFIG_TRANSPARENT_HUGEPAGEpgtable_t pmd_huge_pte; /* protected by page_table_lock */#endif};

这个结构在代码中，经常是mm的名字。

他比vm_area_struct结构更上层，每个进程只有一个mm_struct结构在task_struct中由指针。因此mm_struct结构会更具总结型。

所以虚拟空间的联系图为：

总结一下：

讲解了，

1.PGD,PTE,

2.node--->zone---->page

3. mm_struct------>vm_area_struct

虚拟地址和物理地址通过 PGD,PTE联系起来。

2.4越界访问

linux中的虚拟地址通过PGD，PTE等映射到物理地址。但当这个映射过程无法正常映射时候，就会报错，产生page fault exception。那么什么时候会无法正常呢？

编程错误。程序使用了不存在的地址
不是编程错误，linux的请求调页机制。即：当进程运行时，linux并不将全部的资源分配给进程，而是仅分配当前需要的这一部分，当进程需要另外的资源的时候（这时候就会产生缺页异常），linux再分配这部分。

编程错误linux肯定不会手软的，直接弄死进程。请求调页机制，linux会申请页的。

当MMU对不存在的虚拟地址进行映射的时候，会产生异常，__dabt_usr: __dabt_svc。他们都会调用do_DataAbort。

[cpp] view plaincopyprint?

__dabt_usr:
usr_entry
kuser_cmpxchg_check
@
@ Call the processor-specific abort handler:
@
@ r2 - aborted context pc
@ r3 - aborted context cpsr
@
@ The abort handler must return the aborted address in r0, and
@ the fault status register in r1. //说明了调用do_DataAbort时候传递给他的参数。 r0，r1，
@
#ifdef MULTI_DABORT
ldr r4, .LCprocfns
mov lr, pc
ldr pc, [r4, #PROCESSOR_DABT_FUNC]
#else
bl CPU_DABORT_HANDLER
#endif
@
@ IRQs on, then call the main handler
@
enable_irq
mov r2, sp //参数r2
adr lr, ret_from_exception
b do_DataAbort //调用do_DataAbort
ENDPROC(__dabt_usr)

__dabt_usr:usr_entrykuser_cmpxchg_check@@ Call the processor-specific abort handler:@@ r2 - aborted context pc@ r3 - aborted context cpsr@@ The abort handler must return the aborted address in r0, and@ the fault status register in r1. //说明了调用do_DataAbort时候传递给他的参数。 r0，r1，@#ifdef MULTI_DABORTldrr4, .LCprocfnsmovlr, pcldrpc, [r4, #PROCESSOR_DABT_FUNC]#elseblCPU_DABORT_HANDLER#endif@@ IRQs on, then call the main handler@enable_irqmovr2, sp //参数r2adrlr, ret_from_exceptionbdo_DataAbort //调用do_DataAbortENDPROC(__dabt_usr)

do_DataAbor根据川进来的参数调用 inf->fn ，在这里就是 do_page_fault

[cpp] view plaincopyprint?

asmlinkage void __exception
do_DataAbort(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
{
const struct fsr_info *inf = fsr_info + (fsr & 15) + ((fsr & (1 << 10)) >> 6);
struct siginfo info;
if (!inf->fn(addr, fsr, regs))
return;
printk(KERN_ALERT "Unhandled fault: %s (0x%03x) at 0x%08lx\n",
inf->name, fsr, addr);
info.si_signo = inf->sig;
info.si_errno = 0;
info.si_code = inf->code;
info.si_addr = (void __user *)addr;
arm_notify_die("", regs, &info, fsr, 0);
}

asmlinkage void __exceptiondo_DataAbort(unsigned long addr, unsigned int fsr, struct pt_regs *regs){const struct fsr_info *inf = fsr_info + (fsr & 15) + ((fsr & (1 << 10)) >> 6);struct siginfo info;if (!inf->fn(addr, fsr, regs))return;printk(KERN_ALERT "Unhandled fault: %s (0x%03x) at 0x%08lx\n",inf->name, fsr, addr);info.si_signo = inf->sig;info.si_errno = 0;info.si_code = inf->code;info.si_addr = (void __user *)addr;arm_notify_die("", regs, &info, fsr, 0);}

do_page_fault函数是页异常的重要函数：

ps：因为在，__dabt_usr: __dabt_svc这两种情况下都会调用do_DataAbort函数，然后调用do_page_fault，所以调用do_page_fault可能是在内核空间，也可能是在用户空间。在内核空间可能会是进程通过系统调用，中断等进入的，也可能进程本来是内核线程。所以在do_page_fault中要行进判断的。到底是从哪里发生的异常。

一下函数的分析大部分参考了《深入linux内核》书中的第9章。

[cpp] view plaincopyprint?

static int __kprobes //__kprobes标志应该是调试的一种东西，等以后再专门研究一下。
do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
{
struct task_struct *tsk;
struct mm_struct *mm;
int fault, sig, code;
if (notify_page_fault(regs, fsr)) //这个函数是专门和上面的__kprobes对应的，调试的东西
return 0;
tsk = current;
mm = tsk->mm; //将出现页异常的进程的的进程描述符赋给tsk，内存描述符赋给mm
/*
* If we're in an interrupt or have no user
* context, we must not take the fault..
*/
if (in_atomic() || !mm) //判断发生发生异常的，in_atomic判断是否是在原子操作中：中断程序，可延迟函数，禁用内核抢占的临界区。！mm 判断进程是否是内核线程。参照博客线程调度的文章。
goto no_context; //如果缺页是发生在这些情况下，那么就要特殊处理，因为这些程序都是没有用户空间的，要特殊处理。
/*
* As per x86, we may deadlock here. However, since the kernel only
* validly references user space from well defined areas of the code,
* we can bug out early if this is from code which shouldn't.
*/
if (!down_read_trylock(&mm->mmap_sem)) {
if (!user_mode(regs) && !search_exception_tables(regs->ARM_pc))
goto no_context;
down_read(&mm->mmap_sem);
} //down sem，没啥说的。
fault = __do_page_fault(mm, addr, fsr, tsk); //下面分析。
up_read(&mm->mmap_sem);
/*
* Handle the "normal" case first - VM_FAULT_MAJOR / VM_FAULT_MINOR
*/
if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP | VM_FAULT_BADACCESS))))
return 0; //如果返回值不是上面的值。那么就是MAJOR MINOR，说明问题解决了，return，如果是，那么还要go on
/*
* If we are in kernel mode at this point, we
* have no context to handle this fault with.
*/
if (!user_mode(regs))
goto no_context; //如果是内核空间出现了页异常，并且通过__do_page_fault没有没有解决，那么到on_context
if (fault & VM_FAULT_OOM) {
/*
* We ran out of memory, or some other thing
* happened to us that made us unable to handle
* the page fault gracefully.
*/
printk("VM: killing process %s\n", tsk->comm);
do_group_exit(SIGKILL);
return 0;
}
if (fault & VM_FAULT_SIGBUS) {
/*
* We had some memory, but were unable to
* successfully fix up this page fault.
*/
sig = SIGBUS;
code = BUS_ADRERR;
} else {
/*
* Something tried to access memory that
* isn't in our memory map..
*/
sig = SIGSEGV;
code = fault == VM_FAULT_BADACCESS ?
SEGV_ACCERR : SEGV_MAPERR;
} //这上面的英文描述很清楚了
__do_user_fault(tsk, addr, fsr, sig, code, regs); //用户态错误，这个函数什么都不做，就是发个新号，弄死进程
return 0;
no_context:
__do_kernel_fault(mm, addr, fsr, regs); //内核错误，这个函数什么都不干，发送OOP：：：啊啊啊啊啊啊，这个警告曾经弄死多少好汉
return 0;
}

static int __kprobes //__kprobes标志应该是调试的一种东西，等以后再专门研究一下。do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs){struct task_struct *tsk;struct mm_struct *mm;int fault, sig, code;if (notify_page_fault(regs, fsr)) //这个函数是专门和上面的__kprobes对应的，调试的东西return 0;tsk = current;mm = tsk->mm; //将出现页异常的进程的的进程描述符赋给tsk，内存描述符赋给mm /* * If we're in an interrupt or have no user * context, we must not take the fault.. */if (in_atomic() || !mm) //判断发生发生异常的，in_atomic判断是否是在原子操作中：中断程序，可延迟函数，禁用内核抢占的临界区。！mm 判断进程是否是内核线程。参照博客线程调度的文章。goto no_context; //如果缺页是发生在这些情况下，那么就要特殊处理，因为这些程序都是没有用户空间的，要特殊处理。 /* * As per x86, we may deadlock here. However, since the kernel only * validly references user space from well defined areas of the code, * we can bug out early if this is from code which shouldn't. */if (!down_read_trylock(&mm->mmap_sem)) {if (!user_mode(regs) && !search_exception_tables(regs->ARM_pc))goto no_context;down_read(&mm->mmap_sem);} //down sem，没啥说的。fault = __do_page_fault(mm, addr, fsr, tsk); //下面分析。up_read(&mm->mmap_sem);/* * Handle the "normal" case first - VM_FAULT_MAJOR / VM_FAULT_MINOR */if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP | VM_FAULT_BADACCESS))))return 0; //如果返回值不是上面的值。那么就是MAJOR MINOR，说明问题解决了，return，如果是，那么还要go on/* * If we are in kernel mode at this point, we * have no context to handle this fault with. */if (!user_mode(regs))goto no_context; //如果是内核空间出现了页异常，并且通过__do_page_fault没有没有解决，那么到on_contextif (fault & VM_FAULT_OOM) {/* * We ran out of memory, or some other thing * happened to us that made us unable to handle * the page fault gracefully. */printk("VM: killing process %s\n", tsk->comm);do_group_exit(SIGKILL);return 0;}if (fault & VM_FAULT_SIGBUS) {/* * We had some memory, but were unable to * successfully fix up this page fault. */sig = SIGBUS;code = BUS_ADRERR;} else {/* * Something tried to access memory that * isn't in our memory map.. */sig = SIGSEGV;code = fault == VM_FAULT_BADACCESS ?SEGV_ACCERR : SEGV_MAPERR;} //这上面的英文描述很清楚了__do_user_fault(tsk, addr, fsr, sig, code, regs); //用户态错误，这个函数什么都不做，就是发个新号，弄死进程return 0;no_context:__do_kernel_fault(mm, addr, fsr, regs); //内核错误，这个函数什么都不干，发送OOP：：：啊啊啊啊啊啊，这个警告曾经弄死多少好汉return 0;}

[cpp] view plaincopyprint?

static int
__do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr,
struct task_struct *tsk)
{
struct vm_area_struct *vma;
int fault, mask;
vma = find_vma(mm, addr); //find_vma函数是从mm结构中找到一个vm_area_struct结构，这个结构的vm_end值 > 参数addr，但是vm_start可能大于也可能小于，但fing_vma会尽可能找到小于的。即：addr在vma这个区间中。
fault = VM_FAULT_BADMAP;
if (!vma) //如果vma为0，那么想当于所有vma的vm_end地址都 < 参数addr，这个产生异常错误的地址肯定是无效地址了。因为根据进程的结构，如上图，所有vm中vm_end的最大值是3G。
goto out;
if (vma->vm_start > addr) //如果vm_start>addr，说明这个异常的地址是图中的那个空洞中，那么则可能是在用户态的栈中出错的。则跳去 check_stack
goto check_stack;
/*
* Ok, we have a good vm_area for this
* memory access, so we can handle it.
*///如果上面条件都不是，那么得到的vma结构就包含 addr这个地址。有可能是malloc后出错的。用户态下的malloc时候，其实并不分配物理空间，只是返回一个虚拟地址，并产生一个vm_area_struct结构，当真正要用到malloc的空间的时候，才会产生异常，在这里得到真正的物理空间。当然也有可能是其他原因，比如在read only的时候进行write了。
good_area:
if (fsr & (1 << 11)) /* write? */ //fsr从第一段的汇编中得出意义，他是r1.状态。这里看地址异常时候是写还是其他。并复制到mask中
mask = VM_WRITE;
else
mask = VM_READ|VM_EXEC|VM_WRITE;
fault = VM_FAULT_BADACCESS;
if (!(vma->vm_flags & mask))
goto out; //如果是写，但是vma->vm_flag写没有set 1，说明的确是权限的问题，那么就set fault的值，退出。
/*
* If for any reason at all we couldn't handle
* the fault, make sure we exit gracefully rather
* than endlessly redo the fault.
*/
survive: //上面原因都不是，那么就给进程分配一个新的页框。成功返回VM_FAULT_MAJOR（在handle_mm_fault中得到一个页框时候出现了阻塞，进行了睡眠）/VM_FAULT_MINOR（没有睡眠）
fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, fsr & (1 << 11));
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM) //返回OOM，没有足够的内存。到out_of_memory中，sleep一会，retry。
goto out_of_memory;
else if (fault & VM_FAULT_SIGBUS)
return fault;
BUG();
}
if (fault & VM_FAULT_MAJOR)
tsk->maj_flt++;
else
tsk->min_flt++;
return fault;
out_of_memory:
if (!is_global_init(tsk))
goto out;
/*
* If we are out of memory for pid1, sleep for a while and retry
*/
up_read(&mm->mmap_sem);
yield();
down_read(&mm->mmap_sem);
goto survive;
check_stack:
if (vma->vm_flags & VM_GROWSDOWN && !expand_stack(vma, addr)) //检查堆栈，进行expand，扩展原来的栈的vma，然后goto good_area，去得到一个实际的物理地址。
goto good_area;
out:
return fault;
}

static int__do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr,struct task_struct *tsk){struct vm_area_struct *vma;int fault, mask;vma = find_vma(mm, addr); //find_vma函数是从mm结构中找到一个vm_area_struct结构，这个结构的vm_end值 > 参数addr，但是vm_start可能大于也可能小于，但fing_vma会尽可能找到小于的。即：addr在vma这个区间中。fault = VM_FAULT_BADMAP;if (!vma) //如果vma为0，那么想当于所有vma的vm_end地址都 < 参数addr，这个产生异常错误的地址肯定是无效地址了。因为根据进程的结构，如上图，所有vm中vm_end的最大值是3G。goto out;if (vma->vm_start > addr) //如果vm_start>addr，说明这个异常的地址是图中的那个空洞中，那么则可能是在用户态的栈中出错的。则跳去 check_stackgoto check_stack;/* * Ok, we have a good vm_area for this * memory access, so we can handle it. *///如果上面条件都不是，那么得到的vma结构就包含 addr这个地址。有可能是malloc后出错的。用户态下的malloc时候，其实并不分配物理空间，只是返回一个虚拟地址，并产生一个vm_area_struct结构，当真正要用到malloc的空间的时候，才会产生异常，在这里得到真正的物理空间。当然也有可能是其他原因，比如在read only的时候进行write了。good_area:if (fsr & (1 << 11)) /* write? */ //fsr从第一段的汇编中得出意义，他是r1.状态。这里看地址异常时候是写还是其他。并复制到mask中mask = VM_WRITE;elsemask = VM_READ|VM_EXEC|VM_WRITE;fault = VM_FAULT_BADACCESS;if (!(vma->vm_flags & mask))goto out; //如果是写，但是vma->vm_flag写没有set 1，说明的确是权限的问题，那么就set fault的值，退出。/* * If for any reason at all we couldn't handle * the fault, make sure we exit gracefully rather * than endlessly redo the fault. */survive: //上面原因都不是，那么就给进程分配一个新的页框。成功返回VM_FAULT_MAJOR（在handle_mm_fault中得到一个页框时候出现了阻塞，进行了睡眠）/VM_FAULT_MINOR（没有睡眠）fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, fsr & (1 << 11));if (unlikely(fault & VM_FAULT_ERROR)) {if (fault & VM_FAULT_OOM) //返回OOM，没有足够的内存。到out_of_memory中，sleep一会，retry。goto out_of_memory;else if (fault & VM_FAULT_SIGBUS)return fault;BUG();}if (fault & VM_FAULT_MAJOR)tsk->maj_flt++;elsetsk->min_flt++;return fault;out_of_memory:if (!is_global_init(tsk))goto out;/* * If we are out of memory for pid1, sleep for a while and retry */up_read(&mm->mmap_sem);yield();down_read(&mm->mmap_sem);goto survive;check_stack:if (vma->vm_flags & VM_GROWSDOWN && !expand_stack(vma, addr)) //检查堆栈，进行expand，扩展原来的栈的vma，然后goto good_area，去得到一个实际的物理地址。goto good_area;out:return fault;}

进程的用户空间结构:

图上面的堆栈空间个人感觉不对，堆是堆，栈是栈，准确的说应该是栈吧、、。。堆会在brk（）函数中设置的。

这样分析之后基本上2.4和2.5节的内容已经全包含了，下面总结扩展并补充一下下：

当发生页面异常时候，会产生中断，当在用户态时候，会产生，__dabt_usr: 在内核态时__dabt_svc。但他们都会调用到do_DataAbort函数，do_DataAbort会根据中断的寄存器调用相应的处理函数。这里产生页面中断时候，会调用do_page_fault函数。
在do_page_fault函数中，首先会检查中断发生时，是不是在临界区，或者中断中，或者内核线程中。如果是的话，那么就产生个OOPs。因为这些地方是不允许异常的。如果异常会产生阻塞，阻塞就会死锁。死锁程序员就会被弄，被弄了程序员就发过来弄linux，所以linux就先弄了程序员，发出OOPs错误。
如果不在临界区中，那么调用__do_page_fault函数。这个函数会先检查产生异常的地址，是不是进程的已有的线性空间中，即检查vm_area_struct的list中。
如果在的话，检查是不是因为权限的问题产生的异常，如果是，那么说明应用程序是有问题的，直接弄死他。如果不是，有可能是写时复制等一些linux机制。调用handle_mm_fault函数，进行分页等。
如果不在vm_area_struc的list中，如果大于所有的vm_area_struct的vm_end，那么说明是错误的地址，也是直接弄死。如果有小于<vm_end，也小于vm_start，那么说明是在空洞中，应该是栈的问题，去申请更多的栈空间。（为什么<vm_end&&<vm_start就是在栈中，因为malloc等都是事先分配个vm_area_struct结构，当异常时，会找到相应的vm_area_struct结构的。如果找不到，那就是在栈里面了溢出了）。

下面再说handle_mm_fault函数：

他会首先检查是否已经存在了PTE等映射，如果不在alloc 所有的 PUD,PMD,PTE等，建立映射。然后调用handle_pte_fault函数。注意：由于有些在刚建立的pte，所有pte里面全是0，有些是以前是建立好的，所以里面pte里面不是0，可能有其他值。所以下面还会做判断。
handle_pte_fault函数会判断具体缺页的类型，具体分为3类，会根据这三类调用不同的函数。具体调用的函数，以后再说吧。。。。。
1.这个页从来没有被访问过，也就是这个pte中全是0，pte_none这个宏返回1
2.以前访问过这个页，但这个页是非线性磁盘文件的映射，即：dirty位置1， pte_file返回1
3. 以前访问过这个页，但内容已经被保存在磁盘上了，即：dirty位 0

下图是 understand linux kernel书中的一个图，中文图在中文书中的P378

2.6 物理页面的使用和周转

2.8 页面的定期换出

2.9 页面的换入

这三节都是讲的页框的回收和释放内容。

具体代码太难太难，简要介绍原理，尽可能是解释代码吧。

（参考书籍：《深入理解linux内核》17章《Professional Linux Kernel Architecture》Chapter 18）

深入理解linux内核》讲的2.6内核，但有点老，有些函数对不上。《professional linux kernel architecture》是新版（2.6.24）的linux介绍。但是于2.6.39相比也有点对不上。

页框回收算法（page frame reclaiming algorithm，PFRA）

页框回收算法，是回收一些物理内存页。在两种情况下会触发这个算法，

1，当进行alloc_page时等申请内存时，但内存中空闲的page已经不够了。就会触发PFRA去回收一些页。得到更多的free的page。

2，如果仅仅在第1个条件才触发PFRA，那alloc_page函数就会等待很长时间。所以，linux会创建一个kswapd内核线程。会定期的被唤醒去调用PFRA算法，尽量去保证linux有较多的物理内存空间。不会低于high_wmark_pages。

那么去回收那些页呢？PFRA会将“很久”不用的页reclaim到硬盘/flash上，然后释放这个页。对于页被reclaim到硬盘/flash的哪个位置？PFRA进行了分类：不可回收页，可交换页，可同步页，可丢弃页。

不可回收页，包括：本来就是空闲的页，内核空间使用的页，临时锁定的页。内核空间的页linux认为会比较频繁，也比较重要，不会reclaim。
可同步页。如一些打开的文件，过mmap映射到内存的占用的页。这样的页本来在硬盘中就有备份。所以系统只需要查看这些页是否被修改过，如果被修改过，就将其写到硬盘上，然后reclaim，如果没有被修改过，那么直接reclaim。
可交换页。这些页是对应于这2种页的。包括：用户进行的malloc的页，或者和其他进程共享的页等。这些页在硬盘上没有对应的区域，所以要想将其写到硬盘上，就要有一个交换区（swap区），内核会将其写到这个交换区。然后再reclaim。
可丢弃页。如：slab机制保留的未使用的页，文件系统目录项等的一些缓存的页。这些页可以直接丢弃。

那么如何判断页是否“很久”没有用呢？有个LRU（）算法。

在zone这个结构体中，会有一些list：

[cpp] view plaincopyprint?

/* Fields commonly accessed by the page reclaim scanner */
spinlock_t lru_lock;
struct zone_lru {
struct list_head list;
} lru[NR_LRU_LISTS];

/* Fields commonly accessed by the page reclaim scanner */spinlock_tlru_lock;struct zone_lru {struct list_head list;} lru[NR_LRU_LISTS];

用来连接这个zone中的各个不同状态的页。

[cpp] view plaincopyprint?

enum lru_list {
LRU_INACTIVE_ANON = LRU_BASE,
LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
LRU_UNEVICTABLE,
NR_LRU_LISTS
};

enum lru_list {LRU_INACTIVE_ANON = LRU_BASE,LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,LRU_UNEVICTABLE,NR_LRU_LISTS};

LRU_INACTIVE_ANON 和LRU_ACTIVE_ANON对应于匿名页，就是上面说的可交换的页。 LRU_INACTIVE_FILE 和 LRU_ACTIVE_FILE对应于可同步页。LRU_UNEVICTABLE应该指不可回收的页。

讲解一下上图中相关的一些知识：

首先两个方框表示两个list，一个active，一个inactive。从了在zone的这个list中可以找到active的页，也可以从page这个结构体中检查。当page结构体中的 flag这个成员变量，flag&PG_active ==1时，page对应的页是active的，如flag&PG_active ==0,则是inactive的。

在每个方框中，也有两个部分，Ref0和Ref1。他代表页是否刚被访问过。如：当一个进程访问一个页时候，就会调用mark_page_accessed函数，这个函数将这个页的page结构的flag变量 flag |= PG_reference;这样标志页刚刚被访问过。但linux的kswap进程会扫描所有页，将 flag &= ~PG_reference;标志页有一段时间没有被访问过了。这样如果在Ref0中停留过长时间的话，就会调用shrink_active_list函数将这个页移动到inactive list中。

下面介绍回收的流程：

上图就是介绍的两个shrink memory的方法。一个是当alloc_page内存不够时，直接direct page reclaim。一种是swap daemons。

当然他们shrink的强度是不一样的，direct page reclaim是紧急情况的，必须要回收到的。而swap daemons是尽量回收，保留更多的内存空间。所以在try_to_free_pages在调用shrink_zones函数中会传入个priority，这个priority代表shrink的强度，priority会不断减一，强度越来越大。直到释放了足够的alloc_page的空间。如果priority减到0了，还不能shrink足够的空间。那么就是调用out_of_memory函数，来kill 进程，释放空间。

在进行shrink_page_list中，会根据相关的页进行不同的设置。如：

如当shrink LRU_INACTIVE_ANON 里的page的时候，就会将所有对应这个页的页表（PTE）的present位和dirty位请0.但是其他位不为0. 其他位标志了这个页所在的swap区的位置（此时pte里的值已经不是页表信息了，而是表示这个页保存在磁盘的地址。具体意义可看《深入理解linux内核》的710页，换出页标识符）。

当shrink LRU_INACTIVE_FILE 里的page的时候，就会将所有对应这个页的页表（PTE）的present位请0，但dirty位置一.由于这些页在磁盘上本来就是有对应的文件的，所以不需要记录他的具体问题，linux可以找到。

所以就有了上面讨论缺页异常时候，读页的那三个函数。和判断。。。。。。

在上面提到过，将所有对应于该页的pte，由于对应于一个页，可能会有不止一个的PTE映射在这个页上，那么如果通过这个page结构体来找到对应的所有的pte呢？这个机制叫反响映射，在《深入理解linux内核》的回收页框的 674页，反向映射。