Linux IO 之系统缓存(pdflush & dirty page) 及扩展知识

来源：互联网发布：工程类软件编辑：程序博客网时间：2024/06/07 11:13

[原文]

http://www.phpfans.net/article/htmls/201010/MzEwNzAx.html

延伸阅读：

cgroup限制用户IOPS，共用文件系统，引发的思考：

http://blog.163.com/digoal@126/blog/static/163877040201571403648184/

系统缓存相关的几个内核参数 (还有2个是指定bytes的，含义和ratio差不多)：

1. /proc/sys/vm/dirty_background_ratio

该文件表示脏数据到达系统整体内存的百分比，此时触发pdflush进程把脏数据写回磁盘。

缺省设置：10

当用户调用write时，如果发现系统中的脏数据大于这阈值（或dirty_background_bytes ），会触发pdflush进程去写脏数据，但是用户的write调用会立即返回，无需等待。pdflush刷脏页的标准是让脏页降低到该阈值以下。

即使cgroup限制了用户进程的IOPS，也无所谓。

2. /proc/sys/vm/dirty_expire_centisecs

该文件表示如果脏数据在内存中驻留时间超过该值，pdflush进程在下一次将把这些数据写回磁盘。

缺省设置：3000（1/100秒）

3. /proc/sys/vm/dirty_ratio

该文件表示如果进程产生的脏数据到达系统整体内存的百分比，此时用户进程自行把脏数据写回磁盘。

缺省设置：40

当用户调用write时，如果发现系统中的脏数据大于这阈值（或dirty_bytes ），需要自己把脏数据刷回磁盘，降低到这个阈值以下才返回。

注意，此时如果cgroup限制了用户进程的IOPS，那就悲剧了。

4. /proc/sys/vm/dirty_writeback_centisecs

该文件表示pdflush进程的唤醒间隔，周期性把超过dirty_expire_centisecs时间的脏数据写回磁盘。

缺省设置：500（1/100秒）

系统一般在下面三种情况下回写dirty页:

1. 定时方式: 定时回写是基于这样的原则:/proc/sys/vm/dirty_writeback_centisecs的值表示多长时间会启动回写线程,由这个定时器启动的回写线程只回写在内存中为dirty时间超过(/proc/sys/vm/dirty_expire_centisecs / 100)秒的页(这个值默认是3000,也就是30秒),一般情况下dirty_writeback_centisecs的值是500,也就是5秒,所以默认情况下系统会5秒钟启动一次回写线程,把dirty时间超过30秒的页回写,要注意的是,这种方式启动的回写线程只回写超时的dirty页，不会回写没超时的dirty页,可以通过修改/proc中的这两个值，细节查看内核函数wb_kupdate。

2. 内存不足的时候: 这时并不将所有的dirty页写到磁盘,而是每次写大概1024个页面,直到空闲页面满足需求为止

3. 写操作时发现脏页超过一定比例:

当脏页占系统内存的比例超过/proc/sys/vm/dirty_background_ratio 的时候,write系统调用会唤醒pdflush回写dirty page,直到脏页比例低于/proc/sys/vm/dirty_background_ratio,但write系统调用不会被阻塞,立即返回.

当脏页占系统内存的比例超/proc/sys/vm/dirty_ratio的时候, write系统调用会被被阻塞,主动回写dirty page,直到脏页比例低于/proc/sys/vm/dirty_ratio

大数据量项目中的感触：

1 如果写入量巨大，不能期待系统缓存的自动回刷机制，最好采用应用层调用fsync或者sync。如果写入量大，甚至超过了系统缓存自动刷回的速度，就有可能导致系统的脏页率超过/proc/sys/vm/dirty_ratio，这个时候，系统就会阻塞后续的写操作，这个阻塞有可能有5分钟之久，是我们应用无法承受的。因此，一种建议的方式是在应用层，在合适的时机调用fsync。

2 对于关键性能，最好不要依赖于系统cache的作用，如果对性能的要求比较高，最好在应用层自己实现cache，因为系统cache受外界影响太大，说不定什么时候，系统cache就被冲走了。

3 在logic设计中，发现一种需求使用系统cache实现非常合适，对于logic中的高楼贴，在应用层cache实现非常复杂，而其数量又非常少，这部分请求，可以依赖于系统cache发挥作用，但需要和应用层cache相配合，应用层cache可以cache住绝大部分的非高楼贴的请求，做到这一点后，整个程序对系统的io就主要在高楼贴这部分了。这种情况下，系统cache可以做到很好的效果。

磁盘预读：

关于预读摘录如下两段：

预读算法概要

1.顺序性检测

为了保证预读命中率，Linux只对顺序读(sequential read)进行预读。内核通过验证如下两个条件来判定一个read()是否顺序读：

◆这是文件被打开后的第一次读，并且读的是文件首部；

◆当前的读请求与前一（记录的）读请求在文件内的位置是连续的。

如果不满足上述顺序性条件，就判定为随机读。任何一个随机读都将终止当前的顺序序列，从而终止预读行为（而不是缩减预读大小）。注意这里的空间顺序性说的是文件内的偏移量，而不是指物理磁盘扇区的连续性。在这里Linux作了一种简化，它行之有效的基本前提是文件在磁盘上是基本连续存储的，没有严重的碎片化。

2.流水线预读

当程序在处理一批数据时，我们希望内核能在后台把下一批数据事先准备好，以便CPU和硬盘能流水线作业。Linux用两个预读窗口来跟踪当前顺序流的预读状态：current窗口和ahead窗口。其中的ahead窗口便是为流水线准备的：当应用程序工作在current窗口时，内核可能正在ahead窗口进行异步预读；一旦程序进入当前的ahead窗口，内核就会立即往前推进两个窗口，并在新的ahead窗口中启动预读I/O。

3.预读的大小

当确定了要进行顺序预读(sequential readahead)时，就需要决定合适的预读大小。预读粒度太小的话，达不到应有的性能提升效果；预读太多，又有可能载入太多程序不需要的页面，造成资源浪费。为此，Linux采用了一个快速的窗口扩张过程：

◆首次预读： readahead_size = read_size * 2; // or *4

预读窗口的初始值是读大小的二到四倍。这意味着在您的程序中使用较大的读粒度（比如32KB）可以稍稍提升I/O效率。

◆后续预读： readahead_size *= 2;

后续的预读窗口将逐次倍增，直到达到系统设定的最大预读大小，其缺省值是128KB。这个缺省值已经沿用至少五年了，在当前更快的硬盘和大容量内存面前，显得太过保守。

# blockdev –setra 2048 /dev/sda

当然预读大小不是越大越好，在很多情况下，也需要同时考虑I/O延迟问题。

其他细节：

1. pread 和pwrite

在多线程io操作中，对io的操作尽量使用pread和pwrite，否则，如果使用seek+write/read的方式的话，就需要在操作时加锁。这种加锁会直接造成多线程对同一个文件的操作在应用层就串行了。从而，多线程带来的好处就被消除了。

使用pread方式，多线程也比单线程要快很多，可见pread系统调用并没有因为同一个文件描述符而相互阻塞。pread和pwrite系统调用在底层实现中是如何做到相同的文件描述符而彼此之间不影响的？多线程比单线程的IOPS增高的主要因素在于调度算法。多线程做pread时相互未严重竞争是次要因素。

内核在执行pread的系统调用时并没有使用inode的信号量，避免了一个线程读文件时阻塞了其他线程；但是pwrite的系统调用会使用inode的信号量，多个线程会在inode信号量处产生竞争。pwrite仅将数据写入cache就返回，时间非常短，所以竞争不会很强烈。

2. 文件描述符需要多套吗？

在使用pread/pwrite的前提下，如果各个读写线程使用各自的一套文件描述符，是否还能进一步提升io性能？

每个文件描述符对应内核中一个叫file的对象，而每个文件对应一个叫inode的对象。假设某个进程两次打开同一个文件，得到了两个文件描述符，那么在内核中对应的是两个file对象，但只有一个inode对象。文件的读写操作最终由inode对象完成。所以，如果读写线程打开同一个文件的话，即使采用各自独占的文件描述符，但最终都会作用到同一个inode对象上。因此不会提升IO性能。

这两天在调优数据库性能的过程中需要降低操作系统文件Cache对数据库性能的影响，故调研了一些降低文件系统缓存大小的方法，其中一种是通过修改/proc/sys/vm/dirty_background_ration以及/proc/sys/vm/dirty_ratio两个参数的大小来实现。看了不少相关博文的介绍，不过一直弄不清楚这两个参数的区别在哪里，后来看了下面的一篇英文博客才大致了解了它们的不同。

vm.dirty_background_ratio:这个参数指定了当文件系统缓存脏页数量达到系统内存百分之多少时（如5%）就会触发pdflush/flush/kdmflush等后台回写进程运行，将一定缓存的脏页异步地刷入外存；

vm.dirty_ratio:而这个参数则指定了当文件系统缓存脏页数量达到系统内存百分之多少时（如10%），系统不得不开始处理缓存脏页（因为此时脏页数量已经比较多，为了避免数据丢失需要将一定脏页刷入外存）；在此过程中很多应用进程可能会因为系统转而处理文件IO而阻塞。
之前一直错误的一位dirty_ratio的触发条件不可能达到，因为每次肯定会先达到vm.dirty_background_ratio的条件，后来才知道自己理解错了。确实是先达到vm.dirty_background_ratio的条件然后触发flush进程进行异步的回写操作，但是这一过程中应用进程仍然可以进行写操作，如果多个应用进程写入的量大于flush进程刷出的量那自然会达到vm.dirty_ratio这个参数所设定的坎，此时操作系统会转入同步地处理脏页的过程，阻塞应用进程。

附上原文：

Better Linux Disk Caching & Performance with vm.dirty_ratio & vm.dirty_background_ratio

by BOB PLANKERS on DECEMBER 22, 2013

in BEST PRACTICES,CLOUD,SYSTEM ADMINISTRATION,VIRTUALIZATION

This is post #16 in my December 2013 series about Linux Virtual Machine Performance Tuning. For more, please see the tag “Linux VM Performance Tuning.”

In previous posts on vm.swappiness and using RAM disks we talked about how the memory on a Linux guest is used for the OS itself (the kernel, buffers, etc.), applications, and also for file cache. File caching is an important performance improvement, and read caching is a clear win in most cases, balanced against applications using the RAM directly. Write caching is trickier. The Linux kernel stages disk writes into cache, and over time asynchronously flushes them to disk. This has a nice effect of speeding disk I/O but it is risky. When data isn’t written to disk there is an increased chance of losing it.

There is also the chance that a lot of I/O will overwhelm the cache, too. Ever written a lot of data to disk all at once, and seen large pauses on the system while it tries to deal with all that data? Those pauses are a result of the cache deciding that there’s too much data to be written asynchronously (as a non-blocking background operation, letting the application process continue), and switches to writing synchronously (blocking and making the process wait until the I/O is committed to disk). Of course, a filesystem also has to preserve write order, so when it starts writing synchronously it first has to destage the cache. Hence the long pause.

The nice thing is that these are controllable options, and based on your workloads & data you can decide how you want to set them up. Let’s take a look:

$ sysctl -a | grep dirty vm.dirty_background_ratio = 10 vm.dirty_background_bytes = 0 vm.dirty_ratio = 20 vm.dirty_bytes = 0 vm.dirty_writeback_centisecs = 500 vm.dirty_expire_centisecs = 3000

vm.dirty_background_ratio is the percentage of system memory that can be filled with “dirty” pages — memory pages that still need to be written to disk — before the pdflush/flush/kdmflush background processes kick in to write it to disk. My example is 10%, so if my virtual server has 32 GB of memory that’s 3.2 GB of data that can be sitting in RAM before something is done.

vm.dirty_ratio is the absolute maximum amount of system memory that can be filled with dirty pages before everything must get committed to disk. When the system gets to this point all new I/O blocks until dirty pages have been written to disk. This is often the source of long I/O pauses, but is a safeguard against too much data being cached unsafely in memory.

vm.dirty_background_bytes and vm.dirty_bytes are another way to specify these parameters. If you set the _bytes version the _ratio version will become 0, and vice-versa.

vm.dirty_expire_centisecs is how long something can be in cache before it needs to be written. In this case it’s 30 seconds. When the pdflush/flush/kdmflush processes kick in they will check to see how old a dirty page is, and if it’s older than this value it’ll be written asynchronously to disk. Since holding a dirty page in memory is unsafe this is also a safeguard against data loss.

vm.dirty_writeback_centisecs is how often the pdflush/flush/kdmflush processes wake up and check to see if work needs to be done.

You can also see statistics on the page cache in /proc/vmstat:

$ cat /proc/vmstat | egrep "dirty|writeback" nr_dirty 878 nr_writeback 0 nr_writeback_temp 0

In my case I have 878 dirty pages waiting to be written to disk.

Approach 1: Decreasing the Cache

As with most things in the computer world, how you adjust these depends on what you’re trying to do. In many cases we have fast disk subsystems with their own big, battery-backed NVRAM caches, so keeping things in the OS page cache is risky. Let’s try to send I/O to the array in a more timely fashion and reduce the chance our local OS will, to borrow a phrase from the service industry, be “in the weeds.” To do this we lower vm.dirty_background_ratio and vm.dirty_ratio by adding new numbers to/etc/sysctl.conf and reloading with “sysctl –p”:

vm.dirty_background_ratio = 5 vm.dirty_ratio = 10

This is a typical approach on virtual machines, as well as Linux-based hypervisors. I wouldn’t suggest setting these parameters to zero, as some background I/O is nice to decouple application performance from short periods of higher latency on your disk array & SAN (“spikes”).

Approach 2: Increasing the Cache

There are scenarios where raising the cache dramatically has positive effects on performance. These situations are where the data contained on a Linux guest isn’t critical and can be lost, and usually where an application is writing to the same files repeatedly or in repeatable bursts. In theory, by allowing more dirty pages to exist in memory you’ll rewrite the same blocks over and over in cache, and just need to do one write every so often to the actual disk. To do this we raise the parameters:

vm.dirty_background_ratio = 50 vm.dirty_ratio = 80

Sometimes folks also increase the vm.dirty_expire_centisecs parameter to allow more time in cache. Beyond the increased risk of data loss, you also run the risk of long I/O pauses if that cache gets full and needs to destage, because on large VMs there will be a lot of data in cache.

Approach 3: Both Ways

There are also scenarios where a system has to deal with infrequent, bursty traffic to slow disk (batch jobs at the top of the hour, midnight, writing to an SD card on a Raspberry Pi, etc.). In that case an approach might be to allow all that write I/O to be deposited in the cache so that the background flush operations can deal with it asynchronously over time:

vm.dirty_background_ratio = 5 vm.dirty_ratio = 80

Here the background processes will start writing right away when it hits that 5% ceiling but the system won’t force synchronous I/O until it gets to 80% full. From there you just size your system RAM and vm.dirty_ratio to be able to consume all the written data. Again, there are tradeoffs with data consistency on disk, which translates into risk to data. Buy a UPS and make sure you can destage cache before the UPS runs out of power. :)

No matter the route you choose you should always be gathering hard data to support your changes and help you determine if you are improving things or making them worse. In this case you can get data from many different places, including the application itself, /proc/vmstat, /proc/meminfo, iostat, vmstat, and many of the things in /proc/sys/vm. Good luck!

0 0