普通文件读写和预读（转载）

来源：互联网发布：互联网创业知乎编辑：程序博客网时间：2024/06/13 23:12

7)普通文件读写和预读
generic_file_read 负责普通文件的读取(系统调用read)，即可以使用page
cache的一切文件系统。
系统调用read在文件fs/read_write.c中
asmlinkage ssize_t sys_read(unsigned int fd, char * buf, size_t count)
sys_read调用文件系统提供的read,我们以ext2为例就是
/*
* We have mostly NULL's here: the current defaults are ok for
* the ext2 filesystem.
*/
struct file_operations ext2_file_operations = {
llseek: ext2_file_lseek,
read: generic_file_read,
write: generic_file_write,
ioctl: ext2_ioctl,
mmap: generic_file_mmap,
open: ext2_open_file,
release: ext2_release_file,
fsync: ext2_sync_file,
};
一般来讲,文件读取通过 generic_file_read来进行.generic_file_read建立
一个read descriptor,然后交给do_generic_file_read,做真正的读取工作.调用
这个函数的时候传递了一个函数指针:file_read_actor,其作用是复制page内指定
偏移和长度的数据到用户空间.
先看看do_generic_file_read要处理的几个问题:
1) page cache: 普通文件缓存于内核的page cahce,引发linux读写文件时将
   文件看作一个以page size为单位的逻辑页面.读取文件就是将用户读取的
   位置和大小转换成逻辑的页面,从page cache找到内存对应的页面,并将内
   容复制到用户缓冲区. 如果未缓存此文件的对应内容,就要从磁盘上的对应
   文件以文件系统自己的方式读取到内存页面并将此页面加入到page cache.
2) 上面一条是将文件流切割成page 页，然后block_read_full_page(通常是
   这个函数)还会将页面切割为此文件独立的线性block num,最后通过具体的
   文件系统将文件线性的block转换成磁盘线性的block(硬件block num?).
3) 预读: 用户读取文件的时候内核极力猜测用户的意图,试图在用户使用数据
   前就将数据准备好. 这样可以早期启动磁盘的io操作,以dma方式并行处理.
   并且成批的io操作可以提高吞吐量.linux内核的预读对于顺序读取模式应改
   很有效果.
4) 隔离各种文件系统读取文件内容的方式. 就是通过给定文件关联的inode,利
   用函数指针mapping->a_ops->readpage读取文件内容. 具体的例子可以看ext2
    struct address_space_operations ext2_aops = {
         readpage: ext2_readpage,
         writepage: ext2_writepage,
         sync_page: block_sync_page,
         prepare_write: ext2_prepare_write,
         commit_write: generic_commit_write,
         bmap: ext2_bmap
   };
   ext2_readpage直接调用block_read_full_page(page,ext2_get_block).就
   是将文件内线性编址的page index 转换为文件线性编址的block(逻辑块).
   其中 ext2_get_block(*inode,iblock,*bh_result,create)将文件的逻辑块
   号转换为块设备的逻辑块号(块设备上线性编址的block num),最后提交设备
   驱动读取指定物理块.(驱动将设备块号转换为扇区编号..^_^)读写文件页面
   的过程仅做此简析,以后分析buffer相关的文件时再细细品味一下.
{[写到这里时候,发生了一些事情,耽搁了两周. 顺便看了看devfs.. go on]}
   具体再分析do_generic_file_read的时候就逻辑清晰了.
/*
* This is a generic file read routine, and uses the
* inode->i_op->readpage() function for the actual low-level
* stuff.
*
* This is really ugly. But the goto's actually try to clarify some
* of the logic when it comes to error handling etc.
*/
void do_generic_file_read(struct file * filp, loff_t *ppos, read_descriptor_t * desc,
read_actor_t actor)
{
struct inode *inode = filp->f_dentry->d_inode;
struct address_space *mapping = inode->i_mapping;
unsigned long index, offset;
struct page *cached_page;  /*不存在于page cache的时候分配的页面,可能用不到
                           *因为获取锁的时候可能等待,被其他执行流抢了先.
                           */
int reada_ok;
int error;
int max_readahead = get_max_readahead(inode);
/*
   * 在字节流内的位置转换成线性的文件页面流索引
   */
cached_page = NULL;
index = *ppos >> PAGE_CACHE_SHIFT;
offset = *ppos & ~PAGE_CACHE_MASK;
/*
* 看看预读是否有效,及时调整预读量.
* 如果还未曾预读或者被重置,调整read-ahead max的过程就是预读的初始化
*/
/*
* If the current position is outside the previous read-ahead window,
* we reset the current read-ahead context and set read ahead max to zero
* (will be set to just needed value later),
* otherwise, we assume that the file accesses are sequential enough to
* continue read-ahead.
*/
if (index > filp->f_raend || index + filp->f_rawin f_raend) {
                        /*index raend - filp->rawin*/
/*如果用户读取范围超出预读窗口则重新计算预读量和起始位置*/
reada_ok = 0;
filp->f_raend = 0;
filp->f_ralen = 0;
filp->f_ramax = 0;
filp->f_rawin = 0;
} else {
reada_ok = 1;
}
/*
* Adjust the current value of read-ahead max.
* If the read operation stay in the first half page, force no readahead.
* Otherwise try to increase read ahead max just enough to do the read request.
* Then, at least MIN_READAHEAD if read ahead is ok,
* and at most MAX_READAHEAD in all cases.
*/
if (!index && offset + desc->count > 1)) {
/*读取文件的前半个页面,不进行预读*/
filp->f_ramax = 0;
} else {
unsigned long needed;
      /*计算需要读入的页面个数, 注*ppos在页面index的offset位置*/
needed = ((offset + desc->count) >> PAGE_CACHE_SHIFT) + 1;
if (filp->f_ramax f_ramax = needed; /*预读量至少要满足这次读取请求*/
if (reada_ok && filp->f_ramax f_ramax = MIN_READAHEAD;
if (filp->f_ramax > max_readahead)
filp->f_ramax = max_readahead;
}
  /*
* 根据用户要求读取所有请求的页面
*/
for (;;) {
struct page *page, **hash;
unsigned long end_index, nr;
      /*nr:本页面读取的字节数*/
end_index = inode->i_size >> PAGE_CACHE_SHIFT;
if (index > end_index)
break;
nr = PAGE_CACHE_SIZE;
if (index == end_index) {
nr = inode->i_size & ~PAGE_CACHE_MASK;
if (nr i_mmap_shared != NULL)
flush_dcache_page(page);
/*
   * Ok, we have the page, and it's up-to-date, so
   * now we can copy it to user space...
   *
   * The actor routine returns how many bytes were actually used..
   * NOTE! This may not be the same as how much of a user buffer
   * we filled up (we may be padding etc), so we can only update
   * "pos" here (the actor routine has to update the user buffer
   * pointers and the remaining count).
   */
nr = actor(desc, page, offset, nr);
offset += nr;
/*计算下一个要读的页面和偏移*/
index += offset >> PAGE_CACHE_SHIFT;
offset &= ~PAGE_CACHE_MASK;

page_cache_release(page);
if (nr && desc->count)  /*需要继续*/
continue;
break;  /*读取结束*/
   /*
      * for 循环的主流程结束
      */
/*
* 页面没有含有有效数据的情况
*/
/*
* Ok, the page was not immediately readable, so let's try to read ahead while we're
at it..
*/
page_not_up_to_date:
generic_file_readahead(reada_ok, filp, inode, page);
if (Page_Uptodate(page))
goto page_ok;
/* Get exclusive access to the page ... */
lock_page(page);
/* Did it get unhashed before we got the lock? */
if (!page->mapping) {
UnlockPage(page);
page_cache_release(page);
continue;
}
/* Did somebody else fill it already? */
if (Page_Uptodate(page)) {
UnlockPage(page);
goto page_ok;
}
readpage:/*无有效数据和页面不在page cache 的情况也许都要read page (no_cached_page)*/
/* ... and start the actual read. The read will unlock the page. */
error = mapping->a_ops->readpage(filp, page);
if (!error) {
if (Page_Uptodate(page))
goto page_ok;
/* Again, try some read-ahead while waiting for the page to finish.. */
generic_file_readahead(reada_ok, filp, inode, page);
wait_on_page(page);
if (Page_Uptodate(page))
goto page_ok;
error = -EIO;
}
/* UHHUH! A synchronous read error occurred. Report it */
desc->error = error;
page_cache_release(page);
break;
/*
* 未在page cache 发现指定页面,只有分配一个了
*/
no_cached_page:
/*
   * Ok, it wasn't cached, so we need to create a new
   * page..
   *
   * We get here with the page cache lock held.
   */
if (!cached_page) {
spin_unlock(&pagecache_lock);
cached_page = page_cache_alloc();
if (!cached_page) {
desc->error = -ENOMEM;
break;
}
/*
   * Somebody may have added the page while we
   * dropped the page cache lock. Check for that.
   */
spin_lock(&pagecache_lock);
page = __find_page_nolock(mapping, index, *hash);
if (page)
goto found_page;
}
/*
   * Ok, add the new page to the hash-queues...
   */
page = cached_page;
__add_to_page_cache(page, mapping, index, hash);
spin_unlock(&pagecache_lock);
cached_page = NULL;
goto readpage;
} /*end for*/
*ppos = ((loff_t) index f_reada = 1;
if (cached_page)
page_cache_free(cached_page);
UPDATE_ATIME(inode);
}
函数的分析就是上面的注释.另外一个问题就是预读. do_generic_file_read
当然是进行文件的预读的最好的时机.在这里建立预读的context(一直在想contex
的最佳译法),检查预读是否有效.
为了搞清楚预读的各个变量我们分三遍读do_generic_file_read,分别对应:
第一次读取文件,第二次读取文件顺序读取,所以预读命中,第三次读取文件,超出
预读窗口. 来看看和generic_file_readahead如何配合.
条件：
   1) 假设读取不是从0字节开始,比如从8k的地方读
   2) 假设读取的时候进行加锁都比较快,io没有很快完成(这应该是一般
      情况,ide硬盘怎么会有那么快)
第一次读取文件:(假设page cache 无此页面)
+----do_generic_file_read()
   {
         .......
         if (index > filp->f_raend ||....) {..}
            reada_ok = 0; //read 8k,so exceed reada context
         else{ }

            if (!index && offset ...) {
         }
         else {
               unsigned long needed;

               needed = ....;
               if (filp->f_ramax f_ramax = needed;  //f_ramax init
         }

      readpage:
         假设第一次读取,所以page cache没有此页面,需要从hd读入,页面已
         锁.
         if (!error) {
            if (Page_Uptodate(page))
                  goto page_ok;//我们假设读取没有很快完成也是很
                  //合理的,哪有那么快
      //所以进行预读的时候页面是加了锁的,reada_ok为0
        generic_file_readahead(reada_ok, filp, inode, page);
      wait_on_page(page);
      if (Page_Uptodate(page))
               goto page_ok;
      error = -EIO;
         }
   }

+--generic_file_readahead()
   {
         raend = filp->f_raend; /*=0 */
         max_ahead = 0;  /*本次要启动io的页面之数量*/

         if (PageLocked(page)) { //第一次读取文件所以filp->f_ralen 为0
               if (!filp->f_ralen || index >= raend || index +
filp->f_rawin
               if (raend f_ramax; //本次预读filp->f_ramax个页面,
                        //在do_generic_file_read 中已经初始化
               filp->f_rawin = 0; //预读窗口为0,因为还没有预读过(或重新建立预读)
               filp->f_ralen = 1; //上次"预读"了1个页面
               if (!max_ahead) {
                     filp->f_raend  = index +
filp->f_ralen;/*上次预读窗口外的第一个页面*/
                     filp->f_rawin += filp->f_ralen;/*连续有效预读的总个数*/
               }
               }
            }else if (reada_ok ...)
            }
         ahead = 0; /*本次预读的页面个数*/
         while (ahead f_ralen += ahead; //f_ralen代表上次预读的个数,这里为此记录
            filp->f_rawin += filp->f_ralen; //f_rawin代表所有连续有效预读的总量
            filp->f_raend = raend + ahead + 1;//f_raend是预读窗口外第一个页面
            filp->f_ramax += filp->f_ramax;//预读有效,下次预读量加倍
            .....
         }
   }
分析: 第一次读取文件page 2,offset 0,filep各项为0,do_generic_file_read将
reada_ok置0. 将filp->f_ramax置为用户读取的页面个数(有上限).
generic_file_readahead为第一次读取文件建立预读档案并预读一定数量的页面.
第二次读取文件:上次进行了预读,假设page cache 已经有此页面,并且是顺序读
取,命中了预读窗口.
+----do_generic_file_read()
   {
   .......
   if (index > filp->f_raend ||....) {..}
   else{ //命中预读窗口
      reada_ok = 1;
   }

   if (!index && offset ...) {/*读取文件的前半个页面,不进行预读*/
      我们早就不是读前半个页面了
   }
   else {
         unsigned long needed;

            needed = ....;
            //假设上次预读量已经足够了,所以这次f_ramax没有被重置
            //是上次读取量的两倍
            if (filp->f_ramax f_ramax = needed;
   }

     for (;;) {
     //我们已经假设page cache存在此页面
     found_page:
         ....
   if (!Page_Uptodate(page))
      goto page_not_up_to_date; /*假设预读已经完成(没有完成也一样)*/
      //所以进行预读的时候页面是没有加锁的,reada_ok为1
      generic_file_readahead(reada_ok, filp, inode, page);
      ............
   }
  }
  // reada_ok =1 代表此次读取命中预读窗口(但不一定命中上次预读窗)
+--generic_file_readahead()
   {
   raend = filp->f_raend; /*=0 */
      max_ahead = 0;  /*本次要启动io的页面之数量*/

   if (PageLocked(page)) {
      //这次没有加锁,^_^
   }else  if (reada_ok && filp->f_ramax && raend >= 1
&&
         index f_ralen >= raend) {
   /*命中预读窗口,并且命中上次预读的那部分页面,用户真是步
      *步紧逼啊.我们这次读取如果不是如此,就不会再进行任何预读
      *临时决定就假设如此吧.
      */
      /*页面未锁,或许读取完成,或许还没有开始--->*/
         raend -= 1; /*见注释,保持和同步预读有着同样的io max size*/
         if (raend f_ramax + 1;
         if (max_ahead) {
         filp->f_rawin = filp->f_ralen;
         filp->f_ralen = 0; /*将上次预读长度(即"上次预读"窗口)清空*/
         reada_ok    = 2; /*--->所以或许要督促一下,尽快开始读取*/
         }
      }

   ahead = 0; /*本次预读的页面个数*/
   while (ahead f_ralen += ahead; //f_ralen代表上次预读的个数,这里为此记录
      filp->f_rawin += filp->f_ralen; //f_rawin代表所有连续有效预读的总量
      filp->f_raend = raend + ahead + 1;//f_raend是预读窗口外第一个页面
      filp->f_ramax += filp->f_ramax;//预读有效,下次预读量加倍
            .....
   }
}
分析: 第二次读取文件如果用户命中上次预读的那几个页面,证明预读有效,极有
可能是顺序读取,故进行预读(预读量是上次的两倍),并再次加倍预读量.(当然有
上限).
第三次读取:未命中预读窗口. 和第一次预读类似. 这里不再列举.

  预读分为两种: 同步预读和异步预读.从磁盘读取数据如果是DMA方式,总是异步
的.这里应该是数和用户读取文件同时进行的意思,也就是当前页面已经开始io的
情况之下,页面已经上锁,叫做同步.
  异步读取的时候,调用run_task_queue(&tq_disk), 到底干了些啥?
drivers/block/ll_rw_blk.c 函数generic_plug_device,将request_queue_t放入
task queue :tq_disk.块驱动的task queue里都是什么请求?当然是我们的读/写
啦. 印证一下: 同一个文件的函数
void blk_init_queue(request_queue_t * q, request_fn_proc * rfn)
{
INIT_LIST_HEAD(&q->queue_head);
INIT_LIST_HEAD(&q->request_freelist[READ]);
INIT_LIST_HEAD(&q->request_freelist[WRITE]);
elevator_init(&q->elevator, ELEVATOR_LINUS);
blk_init_free_list(q);
q->request_fn      = rfn;  /*note 0*/
q->back_merge_fn    = ll_back_merge_fn;
q->front_merge_fn       = ll_front_merge_fn;
q->merge_requests_fn = ll_merge_requests_fn;
q->make_request_fn = __make_request;
q->plug_tq.sync = 0;
q->plug_tq.routine = &generic_unplug_device; /*note 1*/
q->plug_tq.data = q;
q->plugged         = 0;
/*
   * These booleans describe the queue properties.  We set the
   * default (and most common) values here.  Other drivers can
   * use the appropriate functions to alter the queue properties.
   * as appropriate.
   */
q->plug_device_fn = generic_plug_device; /*note 2*/
q->head_active = 1;
}
  负责初始化blk驱动的请求队列. 对于ide:见drivers/ide/ide-probe.c
static void ide_init_queue(ide_drive_t *drive)
{
request_queue_t *q = &drive->queue;
q->queuedata = HWGROUP(drive);
blk_init_queue(q, do_ide_request);
}
ide 请求队列中的
         q->request_fn = do_ide_request,
         q->plug_tq.routine = &generic_unplug_device;
         q->plug_device_fn = generic_plug_device;
         q->make_request_fn = __make_request;
首先我们请求读入:
  submit_bh->generic_make_request-> q->make_request_fn**__make_request:
__make_request()
{....
if (list_empty(head)) { //如果当前驱动无其他pending的请求
//就将队列plug到task queue,这样,可以在一连串的请求都放入
//请求队列后再开始io,从而可以将连续请求合并到一起
q->plug_device_fn(q, bh->b_rdev); /* is atomic */ /*generic_plug_device*/
goto get_rq;
}
....
add_request-> 将读写请求放入q.
out:
if (!q->plugged) /*如果plug了就不再直接调用request_fn*/
(q->request_fn)(q);  /* do_ide_request*/

}
然后当我们直接调用run_task_queue(&tq_disk)->__run_task_queue->
tq_disk->routine**generic_unplug_device->__generic_unplug_device->
q->request_fn**do_ide_request.
  分析完了这些,就可以理解下面的注释了
generic_file_readahead ()
{
  ..........
/*
* .............
* If we tried to read ahead asynchronously,
* Try to force unplug of the device in order to start an asynchronous
* read IO request.
* ........
*/
if (ahead) {
if (reada_ok == 2) { /*强制unplug,真正开始异步io操作*/
run_task_queue(&tq_disk);
}
....
}
}