[转]Page Cache, the Affair Between Memory and Files

来源：互联网发布：2016淘宝开店新规定编辑：程序博客网时间：2024/05/16 02:52

Previously we looked at how the kernel managesvirtual memory for a user process, but files and I/O were left out. Thispost covers the important and often misunderstood relationship between filesand memory and its consequences for performance.

Two serious problems must be solved by the OS when it comesto files. The first one is the mind-blowing slowness of hard drives, and diskseeks in particular, relative to memory. The second is the need to loadfile contents in physical memory once and share the contents among programs. If youuse ProcessExplorer to poke at Windows processes, you’ll see there are ~15MB worth ofcommon DLLs loaded in every process. My Windows box right now is running 100processes, so without sharing I’d be using up to ~1.5 GB of physical RAM just for common DLLs.No good. Likewise, nearly all Linux programs need ld.so and libc, plus othercommon libraries.

Happily, both problems can be dealt with in one shot: the page cache,where the kernel stores page-sized chunks of files. To illustrate the pagecache, I’ll conjure a Linux program named render, which opens file scene.dat andreads it 512 bytes at a time, storing the file contents into a heap-allocatedblock. The first read goes like this:

After 12KB have been read, render‘s heap and therelevant page frames look thus:

This looks innocent enough, but there’s a lot going on.First, even though this program uses regular read calls, three 4KBpage frames are now in the page cache storing part of scene.dat.People are sometimes surprised by this, but all regular file I/O happens through the page cache.In x86 Linux, the kernel thinks of a file as a sequence of 4KB chunks. If youread a single byte from a file, the whole 4KB chunk containing the byte youasked for is read from disk and placed into the page cache. This makes sensebecause sustained disk throughput is pretty good and programs normally readmore than just a few bytes from a file region. The page cache knows theposition of each 4KB chunk within the file, depicted above as #0, #1, etc.Windows uses 256KB views analogous to pages in the Linux page cache.

Sadly, in a regular file read the kernel must copy thecontents of the page cache into a user buffer, which not only takes cpu timeand hurts the cpucaches, but also wastes physical memory with duplicate data. As per thediagram above, the scene.dat contents are stored twice, and eachinstance of the program would store the contents an additional time. We’vemitigated the disk latency problem but failed miserably at everything else. Memory-mapped filesare the way out of this madness:

When you use file mapping, the kernel maps your program’svirtual pages directly onto the page cache. This can deliver a significantperformance boost: WindowsSystem Programming reports run time improvements of 30% and up relative toregular file reads, while similar figures are reported for Linux and Solaris inAdvancedProgramming in the Unix Environment. You might also save large amounts ofphysical memory, depending on the nature of your application.

As always with performance, measurement iseverything, but memory mapping earns its keep in a programmer’s toolbox.The API is pretty nice too, it allows you to access a file as bytes in memoryand does not require your soul and code readability in exchange for itsbenefits. Mind your addressspace and experiment with mmapin Unix-like systems, CreateFileMappingin Windows, or the many wrappers available in high level languages. When youmap a file its contents are not brought into memory all at once, but rather ondemand via pagefaults. The fault handler maps your virtualpages onto the page cache after obtaining apage frame with the needed file contents. This involves disk I/O if the contentsweren’t cached to begin with.

Now for a pop quiz. Imagine that the last instance of our renderprogram exits. Would the pages storing scene.dat in the page cache be freedimmediately? People often think so, but that would be a bad idea. When youthink about it, it is very common for us to create a file in one program, exit,then use the file in a second program. The page cache must handle that case.When you think moreabout it, why should the kernel ever get rid of page cache contents? Remember that disk is 5orders of magnitude slower than RAM, hence a page cache hit is a huge win. Solong as there’s enough free physical memory, the cache should be kept full. Itis therefore notdependent on a particular process, but rather it’s a system-wide resource. Ifyou run render a week from now and scene.dat is still cached,bonus! This is why the kernel cache size climbs steadily until it hits aceiling. It’s not because the OS is garbage and hogs your RAM, it’s actuallygood behavior because in a way free physical memory is a waste. Better use asmuch of the stuff for caching as possible.

Due to the page cache architecture, when a program calls write()bytes are simply copied to the page cache and the page is marked dirty. DiskI/O normally does not happen immediately, thus your program doesn’t blockwaiting for the disk. On the downside, if the computer crashes your writes willnever make it, hence critical files like database transaction logs must be fsync()ed(though one must still worry about drive controller caches, oy!). Reads, on theother hand, normally block your program until the data is available. Kernelsemploy eager loading to mitigate this problem, an example of which is read ahead wherethe kernel preloads a few pages into the page cache in anticipation of yourreads. You can help the kernel tune its eager loading behavior by providinghints on whether you plan to read a file sequentially or randomly (see madvise(),readahead(),Windowscache hints). Linux does read-aheadfor memory-mapped files, but I’m not sure about Windows. Finally, it’s possibleto bypass the page cache using O_DIRECTin Linux or NO_BUFFERINGin Windows, something database software often does.

A file mapping may be private or shared. Thisrefers only to updatesmade to the contents in memory: in a private mapping the updates are notcommitted to disk or made visible to other processes, whereas in a sharedmapping they are. Kernels use the copy on write mechanism, enabled bypage table entries, to implement private mappings. In the example below, both renderand another program called render3d (am I creative or what?) havemapped

scene.dat privately. Render then writes to its virtualmemory area that maps the file:

The read-only page table entries shown above do not mean the mappingis read only, they’re merely a kernel trick to share physical memory until thelast possible moment. You can see how ‘private’ is a bit of a misnomer untilyou remember it only applies to updates. A consequence of this design is that avirtual page that maps a file privately sees changes done to the file by otherprograms as long asthe page has only been read from. Once copy-on-write is done,changes by others are no longer seen. This behavior is not guaranteed by thekernel, but it’s what you get in x86 and makes sense from an API perspective.By contrast, a shared mapping is simply mapped onto the page cache and that’sit. Updates are visible to other processes and end up in the disk. Finally, ifthe mapping above were read-only, page faults would trigger a segmentationfault instead of copy on write.

Dynamically loaded libraries are brought into yourprogram’s address space via file mapping. There’s nothing magical about it,it’s the same private file mapping available to you via regular APIs. Below isan example showing part of the address spaces from two running instances of thefile-mapping render program, along with physical memory, to tietogether many of the concepts we’ve seen.