Linux内存管理之页面回收

来源:互联网 发布:趋势操盘软件 编辑:程序博客网 时间:2024/05/20 07:19

概述

有内存页面分配,自然就有内存页面回收。一种是主动释放内存,另一种是内核去回收内存页面。

在linux内存充足时,内核会尽量多地使用内存作为文件缓存(page cache)从而提高系统性能,但当内存紧张时,文件缓存页面会被丢弃或回写到块设备中,然后释放出物理内存。当然这会在一定程度上影响系统的性能。

Linux内核将很少使用到的内存换出去交换(swap)分区,以便释放内存,这个机制称为页交换(wappping).这些处理机制统称为页面回收。

页面回收算法

Linux内核中采用的页交换算法主要是LRU算法和第二次机会法(second chance)。

LRU算法

LRUj Least recently used(最近最少使用)的缩写。在内存不足时,最近最少使用的内存页面会成为被换出的候选者。
LRU算法使用链表来管理,分为活跃LRU,不活跃LRU。页面总是在活跃LRU与不活跃LRU之间转移。
这里写图片描述

第二次机会法

从LRU算法上可以看出,当系统内存短缺时,LRU链表尾部的页面将会离开并被换出。当系统需要这些页面时,这些页面会重新置于LRU链表的开头。显然这个设计不是很巧妙,在换出页面的时候,没有考虑到使用情况的频繁程度。也就是即便是频繁使用的页面,依然会因为在LRU链表尾部而被换出。

第二次机会法就是为了改进上述的缺点。当选择置换页面时,依然与LRU算法一样,但二次机会法设置了一个访问状态位。所以要检查页面的访问位,如果是0,就淘汰这页面。如果访问位是1,就给它第二次机会,并选择下一个页面来换出。当该页面得到第二次机会时,它的访问位被 清0,如果在该页在些期、间再次被访问过,访问位则被置1.因此,如果一个页面经常会使用,其访问位总保持为1,它一直不会被淘汰出去。

kswapd内核线程

Linux内核中有一个非常重要的内核线程kswapd,负责定期及在内存不足的情况下回收页面。
kswapd内核线程初始化时会为系统每个NUMA内存节点创建一个名为“kswapd%d”的内核线程。
这里写图片描述

kswap会在内存页面小于PAGE_LOW时被唤醒。

static inline struct page *__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,    struct zonelist *zonelist, enum zone_type high_zoneidx,    nodemask_t *nodemask, struct zone *preferred_zone,    int migratetype){    const gfp_t wait = gfp_mask & __GFP_WAIT;    struct page *page = NULL;    int alloc_flags;    unsigned long pages_reclaimed = 0;    unsigned long did_some_progress;    bool sync_migration = false;    bool deferred_compaction = false;    bool contended_compaction = false;    /*     * In the slowpath, we sanity check order to avoid ever trying to     * reclaim >= MAX_ORDER areas which will never succeed. Callers may     * be using allocators in order of preference for an area that is     * too large.     */    if (order >= MAX_ORDER) {        WARN_ON_ONCE(!(gfp_mask & __GFP_NOWARN));        return NULL;    }    /*     * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and     * __GFP_NOWARN set) should not cause reclaim since the subsystem     * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim     * using a larger set of nodes after it has established that the     * allowed per node queues are empty and that nodes are     * over allocated.     */    if (IS_ENABLED(CONFIG_NUMA) &&            (gfp_mask & GFP_THISNODE) == GFP_THISNODE)        goto nopage;restart:    if (!(gfp_mask & __GFP_NO_KSWAPD))        wake_all_kswapd(order, zonelist, high_zoneidx,                        zone_idx(preferred_zone));//唤醒kswapd线程    /*     * OK, we're below the kswapd watermark and have kicked background     * reclaim. Now things get more complex, so set up alloc_flags according     * to how we want to proceed.     */    alloc_flags = gfp_to_alloc_flags(gfp_mask);    /*     * Find the true preferred zone if the allocation is unconstrained by     * cpusets.     */    if (!(alloc_flags & ALLOC_CPUSET) && !nodemask)        first_zones_zonelist(zonelist, high_zoneidx, NULL,                    &preferred_zone);rebalance:    /* This is the last chance, in general, before the goto nopage. */    page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,            high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,            preferred_zone, migratetype);    if (page)        goto got_pg;    /* Allocate without watermarks if the context allows */    if (alloc_flags & ALLOC_NO_WATERMARKS) {        /*         * Ignore mempolicies if ALLOC_NO_WATERMARKS on the grounds         * the allocation is high priority and these type of         * allocations are system rather than user orientated         */        zonelist = node_zonelist(numa_node_id(), gfp_mask);        page = __alloc_pages_high_priority(gfp_mask, order,                zonelist, high_zoneidx, nodemask,                preferred_zone, migratetype);        if (page) {            goto got_pg;        }    }    /* Atomic allocations - we can't balance anything */    if (!wait)        goto nopage;    /* Avoid recursion of direct reclaim */    if (current->flags & PF_MEMALLOC)        goto nopage;    /* Avoid allocations with no watermarks from looping endlessly */    if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))        goto nopage;    /*     * Try direct compaction. The first pass is asynchronous. Subsequent     * attempts after direct reclaim are synchronous     */    page = __alloc_pages_direct_compact(gfp_mask, order,                    zonelist, high_zoneidx,                    nodemask,                    alloc_flags, preferred_zone,                    migratetype, sync_migration,                    &contended_compaction,                    &deferred_compaction,                    &did_some_progress);    if (page)        goto got_pg;    sync_migration = true;    /*     * If compaction is deferred for high-order allocations, it is because     * sync compaction recently failed. In this is the case and the caller     * requested a movable allocation that does not heavily disrupt the     * system then fail the allocation instead of entering direct reclaim.     */    if ((deferred_compaction || contended_compaction) &&                        (gfp_mask & __GFP_NO_KSWAPD))        goto nopage;    /* Try direct reclaim and then allocating */    page = __alloc_pages_direct_reclaim(gfp_mask, order,                    zonelist, high_zoneidx,                    nodemask,                    alloc_flags, preferred_zone,                    migratetype, &did_some_progress);    if (page)        goto got_pg;    /*     * If we failed to make any progress reclaiming, then we are     * running out of options and have to consider going OOM     */    if (!did_some_progress) {        if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {            if (oom_killer_disabled)                goto nopage;            /* Coredumps can quickly deplete all memory reserves */            if ((current->flags & PF_DUMPCORE) &&                !(gfp_mask & __GFP_NOFAIL))                goto nopage;            page = __alloc_pages_may_oom(gfp_mask, order,                    zonelist, high_zoneidx,                    nodemask, preferred_zone,                    migratetype);            if (page)                goto got_pg;            if (!(gfp_mask & __GFP_NOFAIL)) {                /*                 * The oom killer is not called for high-order                 * allocations that may fail, so if no progress                 * is being made, there are no other options and                 * retrying is unlikely to help.                 */                if (order > PAGE_ALLOC_COSTLY_ORDER)                    goto nopage;                /*                 * The oom killer is not called for lowmem                 * allocations to prevent needlessly killing                 * innocent tasks.                 */                if (high_zoneidx < ZONE_NORMAL)                    goto nopage;            }            goto restart;        }    }    /* Check if we should retry the allocation */    pages_reclaimed += did_some_progress;    if (should_alloc_retry(gfp_mask, order, did_some_progress,                        pages_reclaimed)) {        /* Wait for some write requests to complete then retry */        wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);        goto rebalance;    } else {        /*         * High-order allocations do not necessarily loop after         * direct reclaim and reclaim/compaction depends on compaction         * being called after reclaim so call directly if necessary         */        page = __alloc_pages_direct_compact(gfp_mask, order,                    zonelist, high_zoneidx,                    nodemask,                    alloc_flags, preferred_zone,                    migratetype, sync_migration,                    &contended_compaction,                    &deferred_compaction,                    &did_some_progress);        if (page)            goto got_pg;    }nopage:    warn_alloc_failed(gfp_mask, order, NULL);    return page;got_pg:    if (kmemcheck_enabled)        kmemcheck_pagealloc_alloc(page, order, gfp_mask);    return page;}

接下来我们看一下kswapd线程如何回收内存页面。

static int kswapd(void *p){    unsigned long order, new_order;    unsigned balanced_order;    int classzone_idx, new_classzone_idx;    int balanced_classzone_idx;    pg_data_t *pgdat = (pg_data_t*)p;    struct task_struct *tsk = current;    struct reclaim_state reclaim_state = {        .reclaimed_slab = 0,    };    const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);    lockdep_set_current_reclaim_state(GFP_KERNEL);    if (!cpumask_empty(cpumask))        set_cpus_allowed_ptr(tsk, cpumask);    current->reclaim_state = &reclaim_state;    /*     * Tell the memory management that we're a "memory allocator",     * and that if we need more memory we should get access to it     * regardless (see "__alloc_pages()"). "kswapd" should     * never get caught in the normal page freeing logic.     *     * (Kswapd normally doesn't need memory anyway, but sometimes     * you need a small amount of memory in order to be able to     * page out something else, and this flag essentially protects     * us from recursively trying to free more memory as we're     * trying to free the first piece of memory in the first place).     */    tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;    set_freezable();    order = new_order = 0;    balanced_order = 0;    classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;    balanced_classzone_idx = classzone_idx;    for ( ; ; ) {        bool ret;        /*         * If the last balance_pgdat was unsuccessful it's unlikely a         * new request of a similar or harder type will succeed soon         * so consider going to sleep on the basis we reclaimed at         */        if (balanced_classzone_idx >= new_classzone_idx &&                    balanced_order == new_order) {            new_order = pgdat->kswapd_max_order;            new_classzone_idx = pgdat->classzone_idx;            pgdat->kswapd_max_order =  0;            pgdat->classzone_idx = pgdat->nr_zones - 1;        }        if (order < new_order || classzone_idx > new_classzone_idx) {            /*             * Don't sleep if someone wants a larger 'order'             * allocation or has tigher zone constraints             */            order = new_order;            classzone_idx = new_classzone_idx;        } else {            kswapd_try_to_sleep(pgdat, balanced_order,                        balanced_classzone_idx);            order = pgdat->kswapd_max_order;            classzone_idx = pgdat->classzone_idx;            new_order = order;            new_classzone_idx = classzone_idx;            pgdat->kswapd_max_order = 0;            pgdat->classzone_idx = pgdat->nr_zones - 1;        }        ret = try_to_freeze();        if (kthread_should_stop())            break;        /*         * We can speed up thawing tasks if we don't call balance_pgdat         * after returning from the refrigerator         */        if (!ret) {            trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);            balanced_classzone_idx = classzone_idx;            balanced_order = balance_pgdat(pgdat, order, //balance_pgdat是回收页面的主函数                        &balanced_classzone_idx);        }    }    tsk->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD);    current->reclaim_state = NULL;    lockdep_clear_current_reclaim_state();    return 0;}

balance_pgdat的实现

static unsigned long balance_pgdat(pg_data_t *pgdat, int order,                            int *classzone_idx){    bool pgdat_is_balanced = false;    int i;    int end_zone = 0;   /* Inclusive.  0 = ZONE_DMA */    struct reclaim_state *reclaim_state = current->reclaim_state;    unsigned long nr_soft_reclaimed;    unsigned long nr_soft_scanned;    struct scan_control sc = {        .gfp_mask = GFP_KERNEL,        .may_unmap = 1,        .may_swap = 1,        /*         * kswapd doesn't want to be bailed out while reclaim. because         * we want to put equal scanning pressure on each zone.         */        .nr_to_reclaim = ULONG_MAX,        .order = order,        .target_mem_cgroup = NULL,    };    struct shrink_control shrink = {        .gfp_mask = sc.gfp_mask,    };loop_again:    sc.priority = DEF_PRIORITY;    sc.nr_reclaimed = 0;    sc.may_writepage = !laptop_mode;    count_vm_event(PAGEOUTRUN);    do {        unsigned long lru_pages = 0;        /*         * Scan in the highmem->dma direction for the highest         * zone which needs scanning         */         //从高端ZONE向低端ZONE方向查找第一个处于不平衡状态的end_zone        for (i = pgdat->nr_zones - 1; i >= 0; i--) {            struct zone *zone = pgdat->node_zones + i;            if (!populated_zone(zone))                continue;            if (zone->all_unreclaimable &&                sc.priority != DEF_PRIORITY)                continue;            /*             * Do some background aging of the anon list, to give             * pages a chance to be referenced before reclaiming.             */            age_active_anon(zone, &sc);            /*             * If the number of buffer_heads in the machine             * exceeds the maximum allowed level and this node             * has a highmem zone, force kswapd to reclaim from             * it to relieve lowmem pressure.             */            if (buffer_heads_over_limit && is_highmem_idx(i)) {                end_zone = i;                break;            }            if (!zone_balanced(zone, order, 0, 0)) {                end_zone = i;                break;            } else {                /* If balanced, clear the congested flag */                zone_clear_flag(zone, ZONE_CONGESTED);            }        }        if (i < 0) {            pgdat_is_balanced = true;            goto out;        }        //从最低端 zone开始页面回收,直到end_zone        for (i = 0; i <= end_zone; i++) {            struct zone *zone = pgdat->node_zones + i;            lru_pages += zone_reclaimable_pages(zone);        }        /*         * Now scan the zone in the dma->highmem direction, stopping         * at the last zone which needs scanning.         *         * We do this because the page allocator works in the opposite         * direction.  This prevents the page allocator from allocating         * pages behind kswapd's direction of progress, which would         * cause too much scanning of the lower zones.         */        for (i = 0; i <= end_zone; i++) {            struct zone *zone = pgdat->node_zones + i;            int nr_slab, testorder;            unsigned long balance_gap;            if (!populated_zone(zone))                continue;            if (zone->all_unreclaimable &&                sc.priority != DEF_PRIORITY)                continue;            sc.nr_scanned = 0;            nr_soft_scanned = 0;            /*             * Call soft limit reclaim before calling shrink_zone.             */            nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,                            order, sc.gfp_mask,                            &nr_soft_scanned);            sc.nr_reclaimed += nr_soft_reclaimed;            /*             * We put equal pressure on every zone, unless             * one zone has way too many pages free             * already. The "too many pages" is defined             * as the high wmark plus a "gap" where the             * gap is either the low watermark or 1%             * of the zone, whichever is smaller.             */            balance_gap = min(low_wmark_pages(zone),                (zone->managed_pages +                    KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /                KSWAPD_ZONE_BALANCE_GAP_RATIO);            /*             * Kswapd reclaims only single pages with compaction             * enabled. Trying too hard to reclaim until contiguous             * free pages have become available can hurt performance             * by evicting too much useful data from memory.             * Do not reclaim more than needed for compaction.             */            testorder = order;            if (IS_ENABLED(CONFIG_COMPACTION) && order &&                    compaction_suitable(zone, order) !=                        COMPACT_SKIPPED)                testorder = 0;            if ((buffer_heads_over_limit && is_highmem_idx(i)) ||                !zone_balanced(zone, testorder,                       balance_gap, end_zone)) {                shrink_zone(zone, &sc);                reclaim_state->reclaimed_slab = 0;                nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);                sc.nr_reclaimed += reclaim_state->reclaimed_slab;                if (nr_slab == 0 && !zone_reclaimable(zone))                    zone->all_unreclaimable = 1;            }            /*             * If we're getting trouble reclaiming, start doing             * writepage even in laptop mode.             */            if (sc.priority < DEF_PRIORITY - 2)                sc.may_writepage = 1;            if (zone->all_unreclaimable) {                if (end_zone && end_zone == i)                    end_zone--;                continue;            }            if (zone_balanced(zone, testorder, 0, end_zone))                /*                 * If a zone reaches its high watermark,                 * consider it to be no longer congested. It's                 * possible there are dirty pages backed by                 * congested BDIs but as pressure is relieved,                 * speculatively avoid congestion waits                 */                zone_clear_flag(zone, ZONE_CONGESTED);        }        /*         * If the low watermark is met there is no need for processes         * to be throttled on pfmemalloc_wait as they should not be         * able to safely make forward progress. Wake them         */        if (waitqueue_active(&pgdat->pfmemalloc_wait) &&                pfmemalloc_watermark_ok(pgdat))            wake_up(&pgdat->pfmemalloc_wait);        if (pgdat_balanced(pgdat, order, *classzone_idx)) {            pgdat_is_balanced = true;            break;      /* kswapd: all done */        }        /*         * We do this so kswapd doesn't build up large priorities for         * example when it is freeing in parallel with allocators. It         * matches the direct reclaim path behaviour in terms of impact         * on zone->*_priority.         */        if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)            break;    } while (--sc.priority >= 0);out:    if (!pgdat_is_balanced) {        cond_resched();        try_to_freeze();        /*         * Fragmentation may mean that the system cannot be         * rebalanced for high-order allocations in all zones.         * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX,         * it means the zones have been fully scanned and are still         * not balanced. For high-order allocations, there is         * little point trying all over again as kswapd may         * infinite loop.         *         * Instead, recheck all watermarks at order-0 as they         * are the most important. If watermarks are ok, kswapd will go         * back to sleep. High-order users can still perform direct         * reclaim if they wish.         */        if (sc.nr_reclaimed < SWAP_CLUSTER_MAX)            order = sc.order = 0;        goto loop_again;    }    /*     * If kswapd was reclaiming at a higher order, it has the option of     * sleeping without all zones being balanced. Before it does, it must     * ensure that the watermarks for order-0 on *all* zones are met and     * that the congestion flags are cleared. The congestion flag must     * be cleared as kswapd is the only mechanism that clears the flag     * and it is potentially going to sleep here.     */    if (order) {        int zones_need_compaction = 1;        for (i = 0; i <= end_zone; i++) {            struct zone *zone = pgdat->node_zones + i;            if (!populated_zone(zone))                continue;            /* Check if the memory needs to be defragmented. */            if (zone_watermark_ok(zone, order,                    low_wmark_pages(zone), *classzone_idx, 0))                zones_need_compaction = 0;        }        if (zones_need_compaction)            compact_pgdat(pgdat, order);    }    /*     * Return the order we were reclaiming at so prepare_kswapd_sleep()     * makes a decision on the order we were last reclaiming at. However,     * if another caller entered the allocator slow path while kswapd     * was awake, order will remain at the higher level     */    *classzone_idx = end_zone;    return order;}
原创粉丝点击