Contiguous Memory Allocator (CMA) 源码分析

来源：互联网发布：linux curl post 图片编辑：程序博客网时间：2024/06/06 00:20

Contiguous Memory Allocator (CMA) 源码分析

原创文章，转载请注明出处.转载自: Li Haifeng's Blog
本文链接地址: Contiguous Memory Allocator (CMA) 源码分析

1. 简述
2. 初始化
3. 分配
4. 释放
5. 小结

1. 简述

CMA的全称是contiguous memory allocator，其工作原理是：预留一段的内存给驱动使用，但当驱动不用的时候，memory allocator（buddy system）可以分配给用户进程用作匿名内存或者页缓存。而当驱动需要使用时，就将进程占用的内存通过回收或者迁移的方式将之前占用的预留内存腾出来，供驱动使用。本文对CMA的初始化，分配和释放做一下源码分析(源码版本v3.10).

2. 初始化

CMA的初始化必须在buddy 物理内存管理初始化之前和 memory
block early allocator 分配器初始化之后(可参考dma_contiguous_reserve函数的注释：This function reserves memory from early allocator. It should be called by arch specific code once the early allocator (memblock or bootmem) has
been activated and all other subsystems have already allocated/reserved
memory.)。

在ARM中，初始化CMA的接口是：dma_contiguous_reserve(phys_addr_t limit)。参数limit是指该CMA区域的上限。

setup_arch->arm_memblock_init->dma_contiguous_reserve:

107void __init dma_contiguous_reserve(phys_addr_t limit)108{109phys_addr_t selected_size =0;110111         pr_debug("%s(limit %08lx)\n", __func__,(unsignedlong)limit);112113if(size_cmdline !=-1){114                 selected_size = size_cmdline;115}else{116#ifdef CONFIG_CMA_SIZE_SEL_MBYTES117                 selected_size = size_bytes;118#elif defined(CONFIG_CMA_SIZE_SEL_PERCENTAGE)119                 selected_size = cma_early_percent_memory();120#elif defined(CONFIG_CMA_SIZE_SEL_MIN)121                 selected_size = min(size_bytes, cma_early_percent_memory());122#elif defined(CONFIG_CMA_SIZE_SEL_MAX)123                 selected_size = max(size_bytes, cma_early_percent_memory());124#endif125}126127if(selected_size){128                 pr_debug("%s: reserving %ld MiB for global area\n", __func__,129(unsignedlong)selected_size / SZ_1M);130131                 dma_declare_contiguous(NULL, selected_size,0, limit);132}133};

在该函数中，需要弄清楚俩值，分别是selected_size和limit。selectetd_size是声明CMA区域的大小，limit规定了CMA区域在分配时候的上界。

首先介绍下怎样获得selected_size: 若cmdline中定义了cma=”xxx”,那么就用cmdline中规定的(114行)。若cmdline中没有定义，则看有没有在config文件中定义CONFIG_CMA_SIZE_SEL_MBYTES(117行)或者CONFIG_CMA_SIZE_SEL_PERCENTAGE（119行）。如果前面两个配置项都没有定义，则从CONFIG_CMA_SIZE_MBYTES和CONFIG_CMA_SIZE_PERCENTAGE中选择两者的最小值(121行)或者最大值(123行)。

计算好CMA的size并得到了limit后，就进入dma_declare_contiguous中。

setup_arch->arm_memblock_init->dma_contiguous_reserve->
dma_declare_contiguous:

218/**219  * dma_declare_contiguous() - reserve area for contiguous memory handling220  *                            for particular device221  * @dev:   Pointer to device structure.222  * @size:  Size of the reserved memory.223  * @base:  Start address of the reserved memory (optional, 0 for any).224  * @limit: End address of the reserved memory (optional, 0 for any).225  *226  * This function reserves memory for specified device. It should be227  * called by board specific code when early allocator (memblock or bootmem)228  * is still activate.229  */230int __init dma_declare_contiguous(struct device *dev,phys_addr_t size,231phys_addr_t base,phys_addr_t limit)232{233struct cma_reserved *r =&cma_reserved[cma_reserved_count];234phys_addr_t alignment;235…248249/* Sanitise input arguments */250         alignment = PAGE_SIZE << max(MAX_ORDER -1, pageblock_order);251         base = ALIGN(base, alignment);252         size = ALIGN(size, alignment);253         limit &=~(alignment -1);254255/* Reserve memory */256if(base){257if(memblock_is_region_reserved(base, size)||258                     memblock_reserve(base, size)<0){259                         base =-EBUSY;260goto err;261}262}else{263/*264                  * Use __memblock_alloc_base() since265                  * memblock_alloc_base() panic()s.266                  */267phys_addr_t addr = __memblock_alloc_base(size, alignment, limit);268if(!addr){269                         base =-ENOMEM;270goto err;271}else{272                         base = addr;273}274}275276/*277          * Each reserved area must be initialised later, when more kernel278          * subsystems (like slab allocator) are available.279          */280         r->start = base;281         r->size = size;282         r->dev = dev;283         cma_reserved_count++;284         pr_info("CMA: reserved %ld MiB at %08lx\n",(unsignedlong)size / SZ_1M,285(unsignedlong)base);286287/* Architecture specific contiguous memory fixup. */288         dma_contiguous_early_fixup(base, size);289return0;290 err:291         pr_err("CMA: failed to reserve %ld MiB\n",(unsignedlong)size / SZ_1M);292return base;293}

在该函数中，首先根据输入的参数size和limit，得到CMA区域的基址和大小。基址若没有指定的话（在该初始化的情境中是0），就需要用early allocator分配了。而大小需要进行一个alignment，这个alignment一般是4MB（250行，MAX_ORDER是11， pageblock_order是10）。用early allocator分配的这个CMA会从物理内存lowmem的高地址开始分配。

得到CMA区域的基址和大小后，会存入到cma_reserved[]全局数组中(280～282行)。全局变量cma_reserved_count来标识在cma_reserved[]数组中，保留了多少个cma区（283行）

在ARM的kernel code中，得到的CMA区域还会保存到dma_mmu_remap数组中（这个dma_mmu_remap数据结构只记录基址和大小，下面396～410行）。

396struct dma_contig_early_reserve {397phys_addr_t base;398unsignedlong size;399};400401staticstruct dma_contig_early_reserve dma_mmu_remap[MAX_CMA_AREAS] __initdata;402403staticint dma_mmu_remap_num __initdata;404405void __init dma_contiguous_early_fixup(phys_addr_t base,unsignedlong size)406{407         dma_mmu_remap[dma_mmu_remap_num].base = base;408         dma_mmu_remap[dma_mmu_remap_num].size = size;409         dma_mmu_remap_num++;410}

以上，只是将CMA区域reserve下来并记录到相关的数组中。当buddy系统初始化结束后，会对reserved的CMA区域进行进一步的处理：

200staticint __init cma_init_reserved_areas(void)201{202struct cma_reserved *r = cma_reserved;203unsigned i = cma_reserved_count;204205         pr_debug("%s()\n", __func__);206207for(; i;--i,++r){208struct cma *cma;209                 cma = cma_create_area(PFN_DOWN(r->start),210                                       r->size >> PAGE_SHIFT);211if(!IS_ERR(cma))212                         dev_set_cma_area(r->dev, cma);213}214return0;215}216 core_initcall(cma_init_reserved_areas);

212行，创建的struct cma结构会设置在dev->cma_area中。这样，当某个外设进行CMA分配的时候，便根据dev->cma_area中设定的区间进行CMA
buffer的分配。

cma_init_reserved_areas->
cma_create_area:

159static __init struct cma *cma_create_area(unsignedlong base_pfn,160unsignedlong count)161{162int bitmap_size = BITS_TO_LONGS(count)*sizeof(long);163struct cma *cma;164int ret =-ENOMEM;165166         pr_debug("%s(base %08lx, count %lx)\n", __func__, base_pfn, count);167168         cma = kmalloc(sizeof*cma, GFP_KERNEL);169if(!cma)170return ERR_PTR(-ENOMEM);171172         cma->base_pfn = base_pfn;173         cma->count = count;174         cma->bitmap = kzalloc(bitmap_size, GFP_KERNEL);175176if(!cma->bitmap)177goto no_mem;178179         ret = cma_activate_area(base_pfn, count);180if(ret)181goto error;182183         pr_debug("%s: returned %p\n", __func__,(void*)cma);184return cma;185186 error:187         kfree(cma->bitmap);188 no_mem:189         kfree(cma);190return ERR_PTR(ret);191}

cma_init_reserved_areas->
cma_create_area-> cma_create_area:

137static __init int cma_activate_area(unsignedlong base_pfn,unsignedlong count)138{139unsignedlong pfn = base_pfn;140unsigned i = count >> pageblock_order;141struct zone *zone;142143         WARN_ON_ONCE(!pfn_valid(pfn));144         zone = page_zone(pfn_to_page(pfn));145146do{147unsigned j;148                 base_pfn = pfn;149for(j = pageblock_nr_pages; j;--j, pfn++){150                         WARN_ON_ONCE(!pfn_valid(pfn));151if(page_zone(pfn_to_page(pfn))!= zone)152return-EINVAL;153}154                 init_cma_reserved_pageblock(pfn_to_page(base_pfn));155}while(--i);156return0;157}

在之前CMA初始化的时候，看到其base和size都会pageblock_order对齐。pageblock_order的值是10，即一个pageblock_order代表4MB的内存块（2^10 * PAGE_SIZE）。因此，该函数cma_activate_area是对每一个用于CMA的block进行初始化（140行，155行）。由于CMA规定了，其区域内的页面必须在一个zone中，因此149～153行对每一个页面进行甄别是否都在同一个zone中，然后对CMA区域内的每一个pageblock进行初始化。

cma_init_reserved_areas->
cma_create_area-> cma_create_area-> init_cma_reserved_pageblock:

769#ifdef CONFIG_CMA770/* Free whole pageblock and set it's migration type to MIGRATE_CMA. */771void __init init_cma_reserved_pageblock(struct page *page)772{773unsigned i = pageblock_nr_pages;774struct page *p = page;775776do{777                 __ClearPageReserved(p);778set_page_count(p,0);779}while(++p,--i);780781set_page_refcounted(page);782set_pageblock_migratetype(page, MIGRATE_CMA);783         __free_pages(page, pageblock_order);784         totalram_pages += pageblock_nr_pages;785#ifdef CONFIG_HIGHMEM786if(PageHighMem(page))787                 totalhigh_pages += pageblock_nr_pages;788#endif789}

进入buddy的空闲页面其page->_count都需要为0.因此在778行设置pageblock 区内的每一个page的使用技术都为0.而781行将一个pageblock块的第一个page的_count设置为1的原因是在783行的__free_pages的时候会put_page_testzero减1. 同时还需要设置pageblock的第一个页面的migratetype为MIGRATE_CMA. 所有页面都有一个migratetype放在zone->pageblock_flags中，每个migratetype占3个bit，但对于buddy
system中的pageblock的第一个页面的migratetype才有意义（其他页面设置了也用不上）。

对页面的初始化做完后，就通过__free_pages将其存放在buddy system中(783行)。

由此可见在初始化的时候，所有的CMA都放在order为10的buddy链表中，具体放在相关zone的zone->free_area[10].free_list[MIGRATE_CMA]链表上。

3. 分配

CMA并不直接开放给driver的开发者。开发者只需要在需要分配dma缓冲区的时候，调用dma相关函数就可以了，例如dma_alloc_coherent。最终dma相关的分配函数会到达cma的分配函数：dma_alloc_from_contiguous

295/**296  * dma_alloc_from_contiguous() - allocate pages from contiguous area297  * @dev:   Pointer to device for which the allocation is performed.298  * @count: Requested number of pages.299  * @align: Requested alignment of pages (in PAGE_SIZE order).300  *301  * This function allocates memory buffer for specified device. It uses302  * device specific contiguous memory area if available or the default303  * global one. Requires architecture specific get_dev_cma_area() helper304  * function.305  */306struct page *dma_alloc_from_contiguous(struct device *dev,int count,307unsignedint align)308{309unsignedlong mask, pfn, pageno, start =0;310struct cma *cma = dev_get_cma_area(dev);311struct page *page = NULL;312int ret;313314if(!cma ||!cma->count)315return NULL;316317if(align > CONFIG_CMA_ALIGNMENT)318                 align = CONFIG_CMA_ALIGNMENT;319320         pr_debug("%s(cma %p, count %d, align %d)\n", __func__,(void*)cma,321                  count, align);322323if(!count)324return NULL;325326         mask =(1<< align)-1;327328         mutex_lock(&cma_mutex);329330for(;;){331                 pageno = bitmap_find_next_zero_area(cma->bitmap, cma->count,332                                                     start, count, mask);333if(pageno >= cma->count)334break;335336                 pfn = cma->base_pfn + pageno;337                 ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA);338if(ret ==0){339                         bitmap_set(cma->bitmap, pageno, count);340                         page = pfn_to_page(pfn);341break;342}elseif(ret !=-EBUSY){343break;344}345                 pr_debug("%s(): memory range at %p is busy, retrying\n",346                          __func__, pfn_to_page(pfn));347/* try again with a bit different memory target */348                 start = pageno + mask +1;349}350351         mutex_unlock(&cma_mutex);352         pr_debug("%s(): returned %p\n", __func__, page);353return page;354}

301～304行的注释，告诉该函数的目的是从特定的driver(或者系统默认)CMA中分配一段buffer. 310行是获取特定driver的CMA区域，若dev没有对应的CMA，则从系统默认的CMA区中查找。每一个CMA区域都有一个bitmap用来记录对应的page是否已经被使用(struct cma->bitmap)。因此从CMA区域查找一定数量的连续内存页的方法就是在cma->bitmap中查找连续的N个为0的bit，代表连续的N个物理页。若找到的话就返回一个不大于CMA边界的索引(333行)并设置对应的cma->bitmap中的bit位（339行）。

dma_alloc_from_contiguous-> dma_alloc_from_contiguous:

5908/**5909  * alloc_contig_range() -- tries to allocate given range of pages5910  * @start:      start PFN to allocate5911  * @end:        one-past-the-last PFN to allocate5912  * @migratetype:        migratetype of the underlaying pageblocks (either5913  *                      #MIGRATE_MOVABLE or #MIGRATE_CMA).  All pageblocks5914  *                      in range must have the same migratetype and it must5915  *                      be either of the two.5916  *5917  * The PFN range does not have to be pageblock or MAX_ORDER_NR_PAGES5918  * aligned, however it's the caller's responsibility to guarantee that5919  * we are the only thread that changes migrate type of pageblocks the5920  * pages fall in.5921  *5922  * The PFN range must belong to a single zone.5923  *5924  * Returns zero on success or negative error code.  On success all5925  * pages which PFN is in [start, end) are allocated for the caller and5926  * need to be freed with free_contig_range().5927  */5928int alloc_contig_range(unsignedlong start,unsignedlong end,5929unsigned migratetype)5930{5931unsignedlong outer_start, outer_end;5932int ret =0, order;59335934struct compact_control cc ={5935.nr_migratepages =0,5936.order =-1,5937.zone = page_zone(pfn_to_page(start)),5938.sync =true,5939.ignore_skip_hint =true,5940};5941         INIT_LIST_HEAD(&cc.migratepages);59425943/*5944          * What we do here is we mark all pageblocks in range as5945          * MIGRATE_ISOLATE.  Because pageblock and max order pages may5946          * have different sizes, and due to the way page allocator5947          * work, we align the range to biggest of the two pages so5948          * that page allocator won't try to merge buddies from5949          * different pageblocks and change MIGRATE_ISOLATE to some5950          * other migration type.5951          *5952          * Once the pageblocks are marked as MIGRATE_ISOLATE, we5953          * migrate the pages from an unaligned range (ie. pages that5954          * we are interested in).  This will put all the pages in5955          * range back to page allocator as MIGRATE_ISOLATE.5956          *5957          * When this is done, we take the pages in range from page5958          * allocator removing them from the buddy system.  This way5959          * page allocator will never consider using them.5960          *5961          * This lets us mark the pageblocks back as5962          * MIGRATE_CMA/MIGRATE_MOVABLE so that free pages in the5963          * aligned range but not in the unaligned, original range are5964          * put back to page allocator so that buddy can use them.5965          */59665967         ret = start_isolate_page_range(pfn_max_align_down(start),5968                                        pfn_max_align_up(end), migratetype,5969false);5970if(ret)5971return ret;59725973         ret = __alloc_contig_migrate_range(&cc, start, end);5974if(ret)5975goto done;59765977/*5978          * Pages from [start, end) are within a MAX_ORDER_NR_PAGES5979          * aligned blocks that are marked as MIGRATE_ISOLATE.  What's5980          * more, all pages in [start, end) are free in page allocator.5981          * What we are going to do is to allocate all pages from5982          * [start, end) (that is remove them from page allocator).5983          *5984          * The only problem is that pages at the beginning and at the5985          * end of interesting range may be not aligned with pages that5986          * page allocator holds, ie. they can be part of higher order5987          * pages.  Because of this, we reserve the bigger range and5988          * once this is done free the pages we are not interested in.5989          *5990          * We don't have to hold zone->lock here because the pages are5991          * isolated thus they won't get removed from buddy.5992          */59935994         lru_add_drain_all();5995         drain_all_pages();59965997         order =0;5998         outer_start = start;5999while(!PageBuddy(pfn_to_page(outer_start))){6000if(++order >= MAX_ORDER){6001                         ret =-EBUSY;6002goto done;6003}6004                 outer_start &=~0UL<< order;6005}60066007/* Make sure the range is really isolated. */6008if(test_pages_isolated(outer_start, end,false)){6009                 pr_warn("alloc_contig_range test_pages_isolated(%lx, %lx) failed\n",6010                        outer_start, end);6011                 ret =-EBUSY;6012goto done;6013}601460156016/* Grab isolated pages from freelists. */6017         outer_end = isolate_freepages_range(&cc, outer_start, end);6018if(!outer_end){6019                 ret =-EBUSY;6020goto done;6021}60226023/* Free head and tail (if any) */6024if(start != outer_start)6025                 free_contig_range(outer_start, start - outer_start);6026if(end != outer_end)6027                 free_contig_range(end, outer_end - end);60286029 done:6030         undo_isolate_page_range(pfn_max_align_down(start),6031                                 pfn_max_align_up(end), migratetype);6032return ret;6033}

该函数的注释中讲述了调用该函数需要注意的事项：对齐，所分配的页面都在一个zone中。释放时，需要使用free_contig_range。

在5967行，先对始末区间进行对齐，然后通过start_isolate_page_range，先确认该区间内没有unmovable的页，如果有unmovable的页，那unmovable的页占着内存而不能被迁移，导致整个区间就都不能被用作CMA（start_isolate_page_range->set_migratetype_isolate->has_unmovable_pages）。确认没有unmovable页后，将该区间的pageblock标志为MIGRATE_ISOLATE。并将对应的page在buddy中都移到freearea[].free_list[MIGRATE_ISLOATE]的链表上,并调用drain_all_pages(start_isolate_page_range->set_migratetype_isolate)，将每处理器上暂存的空闲页都释放给buddy（~~因为有可能在要分配的CMA区间中有页面还在pcp的pageset中—pageset记录了每cpu暂存的空闲热页~~）。然后通过5973行的__alloc_contig_migrate_range，将被隔离出的页中，已经被buddy分配出去的页摘出来，然后迁移到其他地方，以腾出物理页给CMA用。腾出连续的物理页后，便会通过6017行的isolate_freepages_range来将这段连续的空闲物理页从buddy
system取下来。

__alloc_contig_migrate_range的代码如下：

dma_alloc_from_contiguous-> dma_alloc_from_contiguous-> __alloc_contig_migrate_range:

5862/* [start, end) must belong to a single zone. */5863staticint __alloc_contig_migrate_range(struct compact_control *cc,5864unsignedlong start,unsignedlong end)5865{5866/* This function is based on compact_zone() from compaction.c. */5867unsignedlong nr_reclaimed;5868unsignedlong pfn = start;5869unsignedint tries =0;5870int ret =0;58715872         migrate_prep();58735874while(pfn < end ||!list_empty(&cc->migratepages)){5875if(fatal_signal_pending(current)){5876                         ret =-EINTR;5877break;5878}58795880if(list_empty(&cc->migratepages)){5881                         cc->nr_migratepages =0;5882                         pfn = isolate_migratepages_range(cc->zone, cc,5883                                                          pfn, end,true);5884if(!pfn){5885                                 ret =-EINTR;5886break;5887}5888                         tries =0;5889}elseif(++tries ==5){5890                         ret = ret <0? ret :-EBUSY;5891break;5892}58935894                 nr_reclaimed = reclaim_clean_pages_from_list(cc->zone,5895&cc->migratepages);5896                 cc->nr_migratepages -= nr_reclaimed;58975898                 ret = migrate_pages(&cc->migratepages, alloc_migrate_target,58990, MIGRATE_SYNC, MR_CMA);5900}5901if(ret <0){5902                 putback_movable_pages(&cc->migratepages);5903return ret;5904}5905return0;5906}

该函数主要进行迁移工作。由于CMA区域的页是允许被buddy system当作movable页分配出去的，所以，如果某些页之前被buddy分配出去了，但在cma->bitmap上仍然记录该页可以被用作CMA，所以这时候就需要将该页迁移到别的地方以将该页腾出来供CMA用。

5882行做的事情是，将被buddy分配出去的页挂到cc->migratepages的链表上。然后通过5894行的reclaim_clean_pages_from_list看是否某些页是clean可以直接回收掉，之后在通过5898行的migrate_pages将暂时不能回收的内存内容迁移到物理内存的其他地方。

隔离需要迁移和回收页的函数isolate_migratepages_range如下：

dma_alloc_from_contiguous-> dma_alloc_from_contiguous-> __alloc_contig_migrate_range->isolate_migratepages_range:

427/** 428  * isolate_migratepages_range() - isolate all migrate-able pages in range. 429  * @zone:       Zone pages are in. 430  * @cc:         Compaction control structure. 431  * @low_pfn:    The first PFN of the range. 432  * @end_pfn:    The one-past-the-last PFN of the range. 433  * @unevictable: true if it allows to isolate unevictable pages 434  * 435  * Isolate all pages that can be migrated from the range specified by 436  * [low_pfn, end_pfn).  Returns zero if there is a fatal signal 437  * pending), otherwise PFN of the first page that was not scanned 438  * (which may be both less, equal to or more then end_pfn). 439  * 440  * Assumes that cc->migratepages is empty and cc->nr_migratepages is 441  * zero. 442  * 443  * Apart from cc->migratepages and cc->nr_migratetypes this function 444  * does not modify any cc's fields, in particular it does not modify 445  * (or read for that matter) cc->migrate_pfn. 446  */447unsignedlong448 isolate_migratepages_range(struct zone *zone,struct compact_control *cc,449unsignedlong low_pfn,unsignedlong end_pfn,bool unevictable)450{451unsignedlong last_pageblock_nr =0, pageblock_nr;452unsignedlong nr_scanned =0, nr_isolated =0;453structlist_head *migratelist =&cc->migratepages;454isolate_mode_t mode =0;455struct lruvec *lruvec;456unsignedlong flags;457bool locked =false;458struct page *page = NULL,*valid_page = NULL;459460/* 461          * Ensure that there are not too many pages isolated from the LRU 462          * list by either parallel reclaimers or compaction. If there are, 463          * delay for some time until fewer pages are isolated 464          */465while(unlikely(too_many_isolated(zone))){466/* async migration should just abort */467if(!cc->sync)468return0;469470                 congestion_wait(BLK_RW_ASYNC, HZ/10);471472if(fatal_signal_pending(current))473return0;474}475476/* Time to isolate some pages for migration */477         cond_resched();478for(; low_pfn < end_pfn; low_pfn++){479/* give a chance to irqs before checking need_resched() */480if(locked &&!((low_pfn+1)% SWAP_CLUSTER_MAX)){481if(should_release_lock(&zone->lru_lock)){482                                 spin_unlock_irqrestore(&zone->lru_lock, flags);483                                 locked =false;484}485}486487/* 488                  * migrate_pfn does not necessarily start aligned to a 489                  * pageblock. Ensure that pfn_valid is called when moving 490                  * into a new MAX_ORDER_NR_PAGES range in case of large 491                  * memory holes within the zone 492                  */493if((low_pfn &(MAX_ORDER_NR_PAGES -1))==0){494if(!pfn_valid(low_pfn)){495                                 low_pfn += MAX_ORDER_NR_PAGES -1;496continue;497}498}499500if(!pfn_valid_within(low_pfn))501continue;502                 nr_scanned++;503504/* 505                  * Get the page and ensure the page is within the same zone. 506                  * See the comment in isolate_freepages about overlapping 507                  * nodes. It is deliberate that the new zone lock is not taken 508                  * as memory compaction should not move pages between nodes. 509                  */510                 page = pfn_to_page(low_pfn);511if(page_zone(page)!= zone)512continue;513514if(!valid_page)515                         valid_page = page;516517/* If isolation recently failed, do not retry */518                 pageblock_nr = low_pfn >> pageblock_order;519if(!isolation_suitable(cc, page))520goto next_pageblock;521522/* Skip if free */523if(PageBuddy(page))524continue;525526/* 527                  * For async migration, also only scan in MOVABLE blocks. Async 528                  * migration is optimistic to see if the minimum amount of work 529                  * satisfies the allocation 530                  */531if(!cc->sync && last_pageblock_nr != pageblock_nr &&532!migrate_async_suitable(get_pageblock_migratetype(page))){533                         cc->finished_update_migrate =true;534goto next_pageblock;535}536537/* 538                  * Check may be lockless but that's ok as we recheck later. 539                  * It's possible to migrate LRU pages and balloon pages 540                  * Skip any other type of page 541                  */542if(!PageLRU(page)){543if(unlikely(balloon_page_movable(page))){544if(locked && balloon_page_isolate(page)){545/* Successfully isolated */546                                         cc->finished_update_migrate =true;547list_add(&page->lru, migratelist);548                                         cc->nr_migratepages++;549                                         nr_isolated++;550goto check_compact_cluster;551}552}553continue;554}555556/* 557                  * PageLRU is set. lru_lock normally excludes isolation 558                  * splitting and collapsing (collapsing has already happened 559                  * if PageLRU is set) but the lock is not necessarily taken 560                  * here and it is wasteful to take it just to check transhuge. 561                  * Check TransHuge without lock and skip the whole pageblock if 562                  * it's either a transhuge or hugetlbfs page, as calling 563                  * compound_order() without preventing THP from splitting the 564                  * page underneath us may return surprising results. 565                  */566if(PageTransHuge(page)){567if(!locked)568goto next_pageblock;569                         low_pfn +=(1<< compound_order(page))-1;570continue;571}572573/* Check if it is ok to still hold the lock */574                 locked = compact_checklock_irqsave(&zone->lru_lock,&flags,575                                                                 locked, cc);576if(!locked || fatal_signal_pending(current))577break;578579/* Recheck PageLRU and PageTransHuge under lock */580if(!PageLRU(page))581continue;582if(PageTransHuge(page)){583                         low_pfn +=(1<< compound_order(page))-1;584continue;585}586587if(!cc->sync)588                         mode |= ISOLATE_ASYNC_MIGRATE;589590if(unevictable)591                         mode |= ISOLATE_UNEVICTABLE;592593                 lruvec = mem_cgroup_page_lruvec(page, zone);594595/* Try isolate the page */596if(__isolate_lru_page(page, mode)!=0)597continue;598599                 VM_BUG_ON(PageTransCompound(page));600601/* Successfully isolated */602                 cc->finished_update_migrate =true;603                 del_page_from_lru_list(page, lruvec, page_lru(page));604list_add(&page->lru, migratelist);605                 cc->nr_migratepages++;606                 nr_isolated++;607608 check_compact_cluster:609/* Avoid isolating too much */610if(cc->nr_migratepages == COMPACT_CLUSTER_MAX){611++low_pfn;612break;613}614615continue;616617 next_pageblock:618                 low_pfn = ALIGN(low_pfn +1, pageblock_nr_pages)-1;619                 last_pageblock_nr = pageblock_nr;620}621622         acct_isolated(zone, locked, cc);623624if(locked)625                 spin_unlock_irqrestore(&zone->lru_lock, flags);626627/* Update the pageblock-skip if the whole pageblock was scanned */628if(low_pfn == end_pfn)629                 update_pageblock_skip(cc, valid_page, nr_isolated,true);630631         trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);632633         count_compact_events(COMPACTMIGRATE_SCANNED, nr_scanned);634if(nr_isolated)635                 count_compact_events(COMPACTISOLATED, nr_isolated);636637return low_pfn;638}

上面函数的作用是在指定分配到的CMA区域的范围内将被使用到的内存(PageLRU(Page)不为空)隔离出来挂到cc->migratepages链表上，以备以后迁移。要迁移的页分两类，一类是可以直接被回收的(比如页缓存)，另一类是暂时不能被回收，内容需要迁移到其他地方。可以被回收的物理页流程如下：

dma_alloc_from_contiguous-> dma_alloc_from_contiguous-> __alloc_contig_migrate_range->reclaim_clean_pages_from_list:

968unsignedlong reclaim_clean_pages_from_list(struct zone *zone,969structlist_head *page_list)970{971struct scan_control sc ={972.gfp_mask = GFP_KERNEL,973.priority = DEF_PRIORITY,974.may_unmap =1,975};976unsignedlong ret, dummy1, dummy2;977struct page *page,*next;978         LIST_HEAD(clean_pages);979980list_for_each_entry_safe(page, next, page_list, lru){981if(page_is_file_cache(page)&&!PageDirty(page)){982ClearPageActive(page);983list_move(&page->lru,&clean_pages);984}985}986987         ret = shrink_page_list(&clean_pages, zone,&sc,988                                 TTU_UNMAP|TTU_IGNORE_ACCESS,989&dummy1,&dummy2,true);990list_splice(&clean_pages, page_list);991         __mod_zone_page_state(zone, NR_ISOLATED_FILE,-ret);992return ret;993}

在该函数中，对于那些用于文件缓存的页，如果是干净的，就进行直接回收（981～989行），下次该内容需要被用到，再次从文件中读取就是了。如果由于一些原因不能被回收掉的，那就挂回cc->migratepages链表上进行迁移（990行）。

迁移页的流程如下：

dma_alloc_from_contiguous-> dma_alloc_from_contiguous-> __alloc_contig_migrate_range->migrate_pages:

988/* 989  * migrate_pages - migrate the pages specified in a list, to the free pages 990  *                 supplied as the target for the page migration 991  * 992  * @from:               The list of pages to be migrated. 993  * @get_new_page:       The function used to allocate free pages to be used 994  *                      as the target of the page migration. 995  * @private:            Private data to be passed on to get_new_page() 996  * @mode:               The migration mode that specifies the constraints for 997  *                      page migration, if any. 998  * @reason:             The reason for page migration. 999  *1000  * The function returns after 10 attempts or if no pages are movable any more1001  * because the list has become empty or no retryable pages exist any more.1002  * The caller should call putback_lru_pages() to return pages to the LRU1003  * or free list only if ret != 0.1004  *1005  * Returns the number of pages that were not migrated, or an error code.1006  */1007int migrate_pages(structlist_head *from,new_page_t get_new_page,1008unsignedlongprivate,enum migrate_mode mode,int reason)1009{1010int retry =1;1011int nr_failed =0;1012int nr_succeeded =0;1013int pass =0;1014struct page *page;1015struct page *page2;1016int swapwrite = current->flags & PF_SWAPWRITE;1017int rc;10181019if(!swapwrite)1020                 current->flags |= PF_SWAPWRITE;10211022for(pass =0; pass <10&& retry; pass++){1023                 retry =0;10241025list_for_each_entry_safe(page, page2, from, lru){1026                         cond_resched();10271028                         rc = unmap_and_move(get_new_page,private,1029                                                 page, pass >2, mode);10301031switch(rc){1032case-ENOMEM:1033goto out;1034case-EAGAIN:1035                                 retry++;1036break;1037case MIGRATEPAGE_SUCCESS:1038                                 nr_succeeded++;1039break;1040default:1041/* Permanent failure */1042                                 nr_failed++;1043break;1044}1045}1046}1047         rc = nr_failed + retry;1048 out:1049if(nr_succeeded)1050                 count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);1051if(nr_failed)1052                 count_vm_events(PGMIGRATE_FAIL, nr_failed);1053         trace_mm_migrate_pages(nr_succeeded, nr_failed, mode, reason);10541055if(!swapwrite)1056                 current->flags &=~PF_SWAPWRITE;10571058return rc;1059}

该函数最重要的一句是1028行的unmap_and_move. 通过get_new_page新分配一个页，然后将内容move过去，并进行unmap.

dma_alloc_from_contiguous-> dma_alloc_from_contiguous-> __alloc_contig_migrate_range->migrate_pages-> unmap_and_move:

858/* 859  * Obtain the lock on page, remove all ptes and migrate the page 860  * to the newly allocated page in newpage. 861  */862staticint unmap_and_move(new_page_t get_new_page,unsignedlongprivate,863struct page *page,int force,enum migrate_mode mode)864{865int rc =0;866int*result = NULL;867struct page *newpage = get_new_page(page,private,&result);868869if(!newpage)870return-ENOMEM;871872if(page_count(page)==1){873/* page was freed from under us. So we are done. */874goto out;875}876877if(unlikely(PageTransHuge(page)))878if(unlikely(split_huge_page(page)))879goto out;880881         rc = __unmap_and_move(page, newpage, force, mode);882883if(unlikely(rc == MIGRATEPAGE_BALLOON_SUCCESS)){884/* 885                  * A ballooned page has been migrated already. 886                  * Now, it's the time to wrap-up counters, 887                  * handle the page back to Buddy and return. 888                  */889                 dec_zone_page_state(page, NR_ISOLATED_ANON +890                                     page_is_file_cache(page));891                 balloon_page_free(page);892return MIGRATEPAGE_SUCCESS;893}894 out:895if(rc !=-EAGAIN){896/* 897                  * A page that has been migrated has all references 898                  * removed and will be freed. A page that has not been 899                  * migrated will have kepts its references and be 900                  * restored. 901                  */902list_del(&page->lru);903                 dec_zone_page_state(page, NR_ISOLATED_ANON +904                                 page_is_file_cache(page));905                 putback_lru_page(page);906}907/* 908          * Move the new page to the LRU. If migration was not successful 909          * then this will free the page. 910          */911         putback_lru_page(newpage);912if(result){913if(rc)914*result = rc;915else916*result = page_to_nid(newpage);917}918return rc;919}

该函数是先进行unmap，然后再进行move（881行），这个move实际上是一个copy动作(__unmap_and_move->move_to_new_page->migrate_page->migrate_page_copy)。随后将迁移后老页释放到buddy系统中(905行)。

至此，CMA分配一段连续空闲物理内存的准备工作已经做完了（已经将连续的空闲物理内存放在一张链表上了cc->freepages）。但这段物理页还在buddy 系统上。因此，需要把它们从buddy 系统上摘除下来。摘除的操作并不通过通用的alloc_pages流程，而是手工进行处理(dma_alloc_from_contiguous-> dma_alloc_from_contiguous –>isolate_freepages_range)。在处理的时候，需要将连续的物理块进行打散(order为N->order为0)，并将物理块打头的页的page->lru从buddy的链表上取下。设置连续物理页块中的每一个物理页的struct page结构(split_free_page函数)，设置其在zone->pageblock_flags迁移属性为MIGRATE_CMA。

4. 释放

释放的流程比较简单。同分配一样，释放CMA的接口直接给dma。比如，dma_free_coherent。它会最终调用到CMA中的释放接口：free_contig_range。

6035void free_contig_range(unsignedlong pfn,unsigned nr_pages)6036{6037unsignedint count =0;60386039for(; nr_pages--; pfn++){6040struct page *page = pfn_to_page(pfn);60416042                 count += page_count(page)!=1;6043                 __free_page(page);6044}6045         WARN(count !=0,"%d pages are still in use!\n", count);6046}

直接遍历每一个物理页将其释放到buddy系统中便是了（6039～6044行）。

5. 小结

CMA的使用避免了因内存预留给指定驱动而减少了系统可用内存的缺点。其CMA内存在驱动不用的时候可以分配给用户进程使用，而当其需要被驱动用作DMA传输时，将之前分配给用户进程的内存通过回收或者迁移的方式腾给驱动使用。

参考：

1. Linux内核最新的连续内存分配器(CMA)——避免预留大块内存,http://blog.csdn.net/21cnbao/article/details/7309757

2. A deep dive into CMA, http://lwn.net/Articles/486301/

From Li Haifeng's Blog, post Contiguous Memory Allocator (CMA) 源码分析

Post Footer automatically generated by wp-posturl plugin for wordpress.

0 0