Linux内存管理之页面回收
来源:互联网 发布:趋势操盘软件 编辑:程序博客网 时间:2024/05/20 07:19
概述
有内存页面分配,自然就有内存页面回收。一种是主动释放内存,另一种是内核去回收内存页面。
在linux内存充足时,内核会尽量多地使用内存作为文件缓存(page cache)从而提高系统性能,但当内存紧张时,文件缓存页面会被丢弃或回写到块设备中,然后释放出物理内存。当然这会在一定程度上影响系统的性能。
Linux内核将很少使用到的内存换出去交换(swap)分区,以便释放内存,这个机制称为页交换(wappping).这些处理机制统称为页面回收。
页面回收算法
Linux内核中采用的页交换算法主要是LRU算法和第二次机会法(second chance)。
LRU算法
LRUj Least recently used(最近最少使用)的缩写。在内存不足时,最近最少使用的内存页面会成为被换出的候选者。
LRU算法使用链表来管理,分为活跃LRU,不活跃LRU。页面总是在活跃LRU与不活跃LRU之间转移。
第二次机会法
从LRU算法上可以看出,当系统内存短缺时,LRU链表尾部的页面将会离开并被换出。当系统需要这些页面时,这些页面会重新置于LRU链表的开头。显然这个设计不是很巧妙,在换出页面的时候,没有考虑到使用情况的频繁程度。也就是即便是频繁使用的页面,依然会因为在LRU链表尾部而被换出。
第二次机会法就是为了改进上述的缺点。当选择置换页面时,依然与LRU算法一样,但二次机会法设置了一个访问状态位。所以要检查页面的访问位,如果是0,就淘汰这页面。如果访问位是1,就给它第二次机会,并选择下一个页面来换出。当该页面得到第二次机会时,它的访问位被 清0,如果在该页在些期、间再次被访问过,访问位则被置1.因此,如果一个页面经常会使用,其访问位总保持为1,它一直不会被淘汰出去。
kswapd内核线程
Linux内核中有一个非常重要的内核线程kswapd,负责定期及在内存不足的情况下回收页面。
kswapd内核线程初始化时会为系统每个NUMA内存节点创建一个名为“kswapd%d”的内核线程。
kswap会在内存页面小于PAGE_LOW时被唤醒。
static inline struct page *__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct zonelist *zonelist, enum zone_type high_zoneidx, nodemask_t *nodemask, struct zone *preferred_zone, int migratetype){ const gfp_t wait = gfp_mask & __GFP_WAIT; struct page *page = NULL; int alloc_flags; unsigned long pages_reclaimed = 0; unsigned long did_some_progress; bool sync_migration = false; bool deferred_compaction = false; bool contended_compaction = false; /* * In the slowpath, we sanity check order to avoid ever trying to * reclaim >= MAX_ORDER areas which will never succeed. Callers may * be using allocators in order of preference for an area that is * too large. */ if (order >= MAX_ORDER) { WARN_ON_ONCE(!(gfp_mask & __GFP_NOWARN)); return NULL; } /* * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and * __GFP_NOWARN set) should not cause reclaim since the subsystem * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim * using a larger set of nodes after it has established that the * allowed per node queues are empty and that nodes are * over allocated. */ if (IS_ENABLED(CONFIG_NUMA) && (gfp_mask & GFP_THISNODE) == GFP_THISNODE) goto nopage;restart: if (!(gfp_mask & __GFP_NO_KSWAPD)) wake_all_kswapd(order, zonelist, high_zoneidx, zone_idx(preferred_zone));//唤醒kswapd线程 /* * OK, we're below the kswapd watermark and have kicked background * reclaim. Now things get more complex, so set up alloc_flags according * to how we want to proceed. */ alloc_flags = gfp_to_alloc_flags(gfp_mask); /* * Find the true preferred zone if the allocation is unconstrained by * cpusets. */ if (!(alloc_flags & ALLOC_CPUSET) && !nodemask) first_zones_zonelist(zonelist, high_zoneidx, NULL, &preferred_zone);rebalance: /* This is the last chance, in general, before the goto nopage. */ page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist, high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS, preferred_zone, migratetype); if (page) goto got_pg; /* Allocate without watermarks if the context allows */ if (alloc_flags & ALLOC_NO_WATERMARKS) { /* * Ignore mempolicies if ALLOC_NO_WATERMARKS on the grounds * the allocation is high priority and these type of * allocations are system rather than user orientated */ zonelist = node_zonelist(numa_node_id(), gfp_mask); page = __alloc_pages_high_priority(gfp_mask, order, zonelist, high_zoneidx, nodemask, preferred_zone, migratetype); if (page) { goto got_pg; } } /* Atomic allocations - we can't balance anything */ if (!wait) goto nopage; /* Avoid recursion of direct reclaim */ if (current->flags & PF_MEMALLOC) goto nopage; /* Avoid allocations with no watermarks from looping endlessly */ if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL)) goto nopage; /* * Try direct compaction. The first pass is asynchronous. Subsequent * attempts after direct reclaim are synchronous */ page = __alloc_pages_direct_compact(gfp_mask, order, zonelist, high_zoneidx, nodemask, alloc_flags, preferred_zone, migratetype, sync_migration, &contended_compaction, &deferred_compaction, &did_some_progress); if (page) goto got_pg; sync_migration = true; /* * If compaction is deferred for high-order allocations, it is because * sync compaction recently failed. In this is the case and the caller * requested a movable allocation that does not heavily disrupt the * system then fail the allocation instead of entering direct reclaim. */ if ((deferred_compaction || contended_compaction) && (gfp_mask & __GFP_NO_KSWAPD)) goto nopage; /* Try direct reclaim and then allocating */ page = __alloc_pages_direct_reclaim(gfp_mask, order, zonelist, high_zoneidx, nodemask, alloc_flags, preferred_zone, migratetype, &did_some_progress); if (page) goto got_pg; /* * If we failed to make any progress reclaiming, then we are * running out of options and have to consider going OOM */ if (!did_some_progress) { if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) { if (oom_killer_disabled) goto nopage; /* Coredumps can quickly deplete all memory reserves */ if ((current->flags & PF_DUMPCORE) && !(gfp_mask & __GFP_NOFAIL)) goto nopage; page = __alloc_pages_may_oom(gfp_mask, order, zonelist, high_zoneidx, nodemask, preferred_zone, migratetype); if (page) goto got_pg; if (!(gfp_mask & __GFP_NOFAIL)) { /* * The oom killer is not called for high-order * allocations that may fail, so if no progress * is being made, there are no other options and * retrying is unlikely to help. */ if (order > PAGE_ALLOC_COSTLY_ORDER) goto nopage; /* * The oom killer is not called for lowmem * allocations to prevent needlessly killing * innocent tasks. */ if (high_zoneidx < ZONE_NORMAL) goto nopage; } goto restart; } } /* Check if we should retry the allocation */ pages_reclaimed += did_some_progress; if (should_alloc_retry(gfp_mask, order, did_some_progress, pages_reclaimed)) { /* Wait for some write requests to complete then retry */ wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50); goto rebalance; } else { /* * High-order allocations do not necessarily loop after * direct reclaim and reclaim/compaction depends on compaction * being called after reclaim so call directly if necessary */ page = __alloc_pages_direct_compact(gfp_mask, order, zonelist, high_zoneidx, nodemask, alloc_flags, preferred_zone, migratetype, sync_migration, &contended_compaction, &deferred_compaction, &did_some_progress); if (page) goto got_pg; }nopage: warn_alloc_failed(gfp_mask, order, NULL); return page;got_pg: if (kmemcheck_enabled) kmemcheck_pagealloc_alloc(page, order, gfp_mask); return page;}
接下来我们看一下kswapd线程如何回收内存页面。
static int kswapd(void *p){ unsigned long order, new_order; unsigned balanced_order; int classzone_idx, new_classzone_idx; int balanced_classzone_idx; pg_data_t *pgdat = (pg_data_t*)p; struct task_struct *tsk = current; struct reclaim_state reclaim_state = { .reclaimed_slab = 0, }; const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id); lockdep_set_current_reclaim_state(GFP_KERNEL); if (!cpumask_empty(cpumask)) set_cpus_allowed_ptr(tsk, cpumask); current->reclaim_state = &reclaim_state; /* * Tell the memory management that we're a "memory allocator", * and that if we need more memory we should get access to it * regardless (see "__alloc_pages()"). "kswapd" should * never get caught in the normal page freeing logic. * * (Kswapd normally doesn't need memory anyway, but sometimes * you need a small amount of memory in order to be able to * page out something else, and this flag essentially protects * us from recursively trying to free more memory as we're * trying to free the first piece of memory in the first place). */ tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; set_freezable(); order = new_order = 0; balanced_order = 0; classzone_idx = new_classzone_idx = pgdat->nr_zones - 1; balanced_classzone_idx = classzone_idx; for ( ; ; ) { bool ret; /* * If the last balance_pgdat was unsuccessful it's unlikely a * new request of a similar or harder type will succeed soon * so consider going to sleep on the basis we reclaimed at */ if (balanced_classzone_idx >= new_classzone_idx && balanced_order == new_order) { new_order = pgdat->kswapd_max_order; new_classzone_idx = pgdat->classzone_idx; pgdat->kswapd_max_order = 0; pgdat->classzone_idx = pgdat->nr_zones - 1; } if (order < new_order || classzone_idx > new_classzone_idx) { /* * Don't sleep if someone wants a larger 'order' * allocation or has tigher zone constraints */ order = new_order; classzone_idx = new_classzone_idx; } else { kswapd_try_to_sleep(pgdat, balanced_order, balanced_classzone_idx); order = pgdat->kswapd_max_order; classzone_idx = pgdat->classzone_idx; new_order = order; new_classzone_idx = classzone_idx; pgdat->kswapd_max_order = 0; pgdat->classzone_idx = pgdat->nr_zones - 1; } ret = try_to_freeze(); if (kthread_should_stop()) break; /* * We can speed up thawing tasks if we don't call balance_pgdat * after returning from the refrigerator */ if (!ret) { trace_mm_vmscan_kswapd_wake(pgdat->node_id, order); balanced_classzone_idx = classzone_idx; balanced_order = balance_pgdat(pgdat, order, //balance_pgdat是回收页面的主函数 &balanced_classzone_idx); } } tsk->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD); current->reclaim_state = NULL; lockdep_clear_current_reclaim_state(); return 0;}
balance_pgdat的实现
static unsigned long balance_pgdat(pg_data_t *pgdat, int order, int *classzone_idx){ bool pgdat_is_balanced = false; int i; int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ struct reclaim_state *reclaim_state = current->reclaim_state; unsigned long nr_soft_reclaimed; unsigned long nr_soft_scanned; struct scan_control sc = { .gfp_mask = GFP_KERNEL, .may_unmap = 1, .may_swap = 1, /* * kswapd doesn't want to be bailed out while reclaim. because * we want to put equal scanning pressure on each zone. */ .nr_to_reclaim = ULONG_MAX, .order = order, .target_mem_cgroup = NULL, }; struct shrink_control shrink = { .gfp_mask = sc.gfp_mask, };loop_again: sc.priority = DEF_PRIORITY; sc.nr_reclaimed = 0; sc.may_writepage = !laptop_mode; count_vm_event(PAGEOUTRUN); do { unsigned long lru_pages = 0; /* * Scan in the highmem->dma direction for the highest * zone which needs scanning */ //从高端ZONE向低端ZONE方向查找第一个处于不平衡状态的end_zone for (i = pgdat->nr_zones - 1; i >= 0; i--) { struct zone *zone = pgdat->node_zones + i; if (!populated_zone(zone)) continue; if (zone->all_unreclaimable && sc.priority != DEF_PRIORITY) continue; /* * Do some background aging of the anon list, to give * pages a chance to be referenced before reclaiming. */ age_active_anon(zone, &sc); /* * If the number of buffer_heads in the machine * exceeds the maximum allowed level and this node * has a highmem zone, force kswapd to reclaim from * it to relieve lowmem pressure. */ if (buffer_heads_over_limit && is_highmem_idx(i)) { end_zone = i; break; } if (!zone_balanced(zone, order, 0, 0)) { end_zone = i; break; } else { /* If balanced, clear the congested flag */ zone_clear_flag(zone, ZONE_CONGESTED); } } if (i < 0) { pgdat_is_balanced = true; goto out; } //从最低端 zone开始页面回收,直到end_zone for (i = 0; i <= end_zone; i++) { struct zone *zone = pgdat->node_zones + i; lru_pages += zone_reclaimable_pages(zone); } /* * Now scan the zone in the dma->highmem direction, stopping * at the last zone which needs scanning. * * We do this because the page allocator works in the opposite * direction. This prevents the page allocator from allocating * pages behind kswapd's direction of progress, which would * cause too much scanning of the lower zones. */ for (i = 0; i <= end_zone; i++) { struct zone *zone = pgdat->node_zones + i; int nr_slab, testorder; unsigned long balance_gap; if (!populated_zone(zone)) continue; if (zone->all_unreclaimable && sc.priority != DEF_PRIORITY) continue; sc.nr_scanned = 0; nr_soft_scanned = 0; /* * Call soft limit reclaim before calling shrink_zone. */ nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask, &nr_soft_scanned); sc.nr_reclaimed += nr_soft_reclaimed; /* * We put equal pressure on every zone, unless * one zone has way too many pages free * already. The "too many pages" is defined * as the high wmark plus a "gap" where the * gap is either the low watermark or 1% * of the zone, whichever is smaller. */ balance_gap = min(low_wmark_pages(zone), (zone->managed_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) / KSWAPD_ZONE_BALANCE_GAP_RATIO); /* * Kswapd reclaims only single pages with compaction * enabled. Trying too hard to reclaim until contiguous * free pages have become available can hurt performance * by evicting too much useful data from memory. * Do not reclaim more than needed for compaction. */ testorder = order; if (IS_ENABLED(CONFIG_COMPACTION) && order && compaction_suitable(zone, order) != COMPACT_SKIPPED) testorder = 0; if ((buffer_heads_over_limit && is_highmem_idx(i)) || !zone_balanced(zone, testorder, balance_gap, end_zone)) { shrink_zone(zone, &sc); reclaim_state->reclaimed_slab = 0; nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages); sc.nr_reclaimed += reclaim_state->reclaimed_slab; if (nr_slab == 0 && !zone_reclaimable(zone)) zone->all_unreclaimable = 1; } /* * If we're getting trouble reclaiming, start doing * writepage even in laptop mode. */ if (sc.priority < DEF_PRIORITY - 2) sc.may_writepage = 1; if (zone->all_unreclaimable) { if (end_zone && end_zone == i) end_zone--; continue; } if (zone_balanced(zone, testorder, 0, end_zone)) /* * If a zone reaches its high watermark, * consider it to be no longer congested. It's * possible there are dirty pages backed by * congested BDIs but as pressure is relieved, * speculatively avoid congestion waits */ zone_clear_flag(zone, ZONE_CONGESTED); } /* * If the low watermark is met there is no need for processes * to be throttled on pfmemalloc_wait as they should not be * able to safely make forward progress. Wake them */ if (waitqueue_active(&pgdat->pfmemalloc_wait) && pfmemalloc_watermark_ok(pgdat)) wake_up(&pgdat->pfmemalloc_wait); if (pgdat_balanced(pgdat, order, *classzone_idx)) { pgdat_is_balanced = true; break; /* kswapd: all done */ } /* * We do this so kswapd doesn't build up large priorities for * example when it is freeing in parallel with allocators. It * matches the direct reclaim path behaviour in terms of impact * on zone->*_priority. */ if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX) break; } while (--sc.priority >= 0);out: if (!pgdat_is_balanced) { cond_resched(); try_to_freeze(); /* * Fragmentation may mean that the system cannot be * rebalanced for high-order allocations in all zones. * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX, * it means the zones have been fully scanned and are still * not balanced. For high-order allocations, there is * little point trying all over again as kswapd may * infinite loop. * * Instead, recheck all watermarks at order-0 as they * are the most important. If watermarks are ok, kswapd will go * back to sleep. High-order users can still perform direct * reclaim if they wish. */ if (sc.nr_reclaimed < SWAP_CLUSTER_MAX) order = sc.order = 0; goto loop_again; } /* * If kswapd was reclaiming at a higher order, it has the option of * sleeping without all zones being balanced. Before it does, it must * ensure that the watermarks for order-0 on *all* zones are met and * that the congestion flags are cleared. The congestion flag must * be cleared as kswapd is the only mechanism that clears the flag * and it is potentially going to sleep here. */ if (order) { int zones_need_compaction = 1; for (i = 0; i <= end_zone; i++) { struct zone *zone = pgdat->node_zones + i; if (!populated_zone(zone)) continue; /* Check if the memory needs to be defragmented. */ if (zone_watermark_ok(zone, order, low_wmark_pages(zone), *classzone_idx, 0)) zones_need_compaction = 0; } if (zones_need_compaction) compact_pgdat(pgdat, order); } /* * Return the order we were reclaiming at so prepare_kswapd_sleep() * makes a decision on the order we were last reclaiming at. However, * if another caller entered the allocator slow path while kswapd * was awake, order will remain at the higher level */ *classzone_idx = end_zone; return order;}
- Linux内存管理之页面回收
- Linux内存管理之页面回收
- 转载:Linux内存管理之页面回收
- Linux内存管理之页面回收
- Linux内存管理之页面回收
- Linux内存管理之页面回收
- Linux内存管理之页面回收
- linux内存管理--内存回收
- Linux内核源代码情景分析-内存管理之slab-回收
- Linux内存管理之页面异常处理
- Linux内存管理之物理页面分配
- Linux内存管理--内存回收(1)
- Linux内核内存管理之BUDDY页面管理(二)
- linux内存管理之内存回收机制
- Java内存管理之垃圾回收
- php内存管理之垃圾回收机制
- linux内存管理 之 页面分配器page allocator
- linux 内存管理 - 分配页面
- JavaWeb监听器学习笔记
- 数据结构---04-树7 二叉搜索树的操作集(30 分)
- leetcode之Add Two Numbers 在VS上面提交通过,放到网站上提交有问题;
- 双节快乐
- QBXT DAY 2 笔记
- Linux内存管理之页面回收
- 面试OR笔试45——实现next_permutation
- 计算机系统概述-程序开发和执行过程简介
- java中生成pdf,插入图片,页眉、页脚、表格
- CF 71C Round Table Knights 暴力
- hdu4081-次小生成树&MST变形&模板-Qin Shi Huang's National Road System
- bzoj4668 冷战 (并查集按秩合并)
- C++单例模式
- JVM (PART IX) 内存分配与回收策略