How Google uses Linux(谷歌是怎么使用Linux的)

来源：互联网发布：mac bash 改为 user 编辑：程序博客网时间：2024/05/18 03:01

KS2009: How Google uses Linux

By Jonathan Corbet
October 21, 2009

LWN's 2009 Kernel Summit coverage

There may be no single organization which runs more Linux systems thanGoogle. But the kernel development community knows little about how Googleuses Linux and what sort of problems are encountered there. Google's MikeWaychison traveled to Tokyo to help shed some light on thissituation; the result was an interesting view on what it takes to run Linuxin this extremely demanding setting.

Mike started the talk by giving the developers a good laugh: it seems thatGoogle manages its kernel code with Perforce. He apologized for that.There is a single tree that all developers commit to. About every 17months, Google rebases its work to a current mainline release; what followsis a long struggle to make everything work again. Once that's done,internal "feature" releases happen about every six months.

This way of doing things is far from ideal; it means that Google lags farbehind the mainline and has a hard time talking with the kernel developmentcommunity about its problems.

There are about 30 engineers working on Google's kernel. Currently theytend to check their changes into the tree, then forget about them for thenext 18 months. This leads to some real maintenance issues; developersoften have little idea of what's actually in Google's tree until it breaks.

And there's a lot in that tree. Google started with the 2.4.18 kernel -but they patched over 2000 files, inserting 492,000 lines of code. Amongother things, they backported 64-bit support into that kernel. Eventuallythey moved to 2.6.11, primarily because they needed SATA support. A2.6.18-based kernel followed, and they are now working on preparing a2.6.26-based kernel for deployment in the near future. They are currentlycarrying 1208 patches to 2.6.26, inserting almost 300,000 lines of code.Roughly 25% of those patches, Mike estimates, are backports of newerfeatures.

There are plans to change all of this; Google's kernel group is trying toget to a point where they can work better with the kernel community.They're moving to git for source code management, and developers willmaintain their changes in their own trees. Those trees will be rebased tomainline kernel releases every quarter; that should, it is hoped, motivatedevelopers to make their code more maintainable and more closely alignedwith the upstream kernel.

Linus asked: why aren't these patches upstream? Is it because Google isembarrassed by them, or is it secret stuff that they don't want todisclose, or is it a matter of internal process problems? The answer wassimply "yes." Some of this code is ugly stuff which has been carriedforward from the 2.4.18 kernel. There are also doubts internally about howmuch of this stuff will be actually useful to the rest of the world. But,perhaps, maybe about half of this code could be upstreamed eventually.

As much as 3/4 of Google's code consists of changes to the core kernel;device support is a relatively small part of the total.

Google has a number of "pain points" which make working with the communityharder. Keeping up with the upstream kernel is hard - it simply moves toofast. There is also a real problem with developers posting a patch, thenbeing asked to rework it in a way which turns it into a much largerproject. Alan Cox had a simple response to that one: people will alwaysask for more, but sometimes the right thing to do is to simply tell them"no."

In the area of CPU scheduling, Google found the move to the completely fairscheduler to be painful. In fact, it was such a problem that they finallyforward-ported the old O(1) scheduler and can run it in 2.6.26. Changes inthe semantics of sched_yield() created grief, especially with theuser-space locking that Google uses. High-priority threads can make a messof load balancing, even if they run for very short periods of time. Andload balancing matters: Google runs something like 5000 threads on systemswith 16-32 cores.

On the memory management side, newer kernels changed the management ofdirty bits, leading to overly aggressive writeout. The system could easilyget into a situation where lots of small I/O operations generated by kswapdwould fill the request queues, starving other writeback; this particularproblem should be fixed by the per-BDI writeback changes in2.6.32.

As noted above, Google runs systems with lots of threads - not an uncommonmode of operation in general. One thing they found is that sending signalsto a large thread group can lead to a lot of run queue lock contention.They also have trouble with contention for the mmap_sem semaphore;one sleeping reader can block a writer which, in turn, blocks otherreaders, bringing the whole thing to a halt. The kernel needs to be fixedto not wait for I/O with that semaphore held.

Google makes a lot of use of the out-of-memory (OOM) killer to pare backoverloaded systems. That can create trouble, though, when processesholding mutexes encounter the OOM killer. Mike wonders why the kerneltries so hard, rather than just failing allocation requests when memorygets too tight.

So what is Google doing with all that code in the kernel? They try veryhard to get the most out of every machine they have, so they cram a lot ofwork onto each. This work is segmented into three classes: "latencysensitive," which gets short-term resource guarantees, "production batch"which has guarantees over longer periods, and "best effort" which gets noguarantees at all. This separation of classes is done partly through the separation of eachmachine into a large number of fake "NUMA nodes." Specific jobs are thenassigned to one or more of those nodes. One thing added by Google is"NUMA-aware VFS LRUs" - virtual memory management which focuses on specificNUMA nodes. Nick Piggin remarked that he has been working on somethinglike that and would have liked to have seen Google's code.

There is a special SCHED_GIDLE scheduling class which is a truly idleclass; if there is no spare CPU available, jobs in that class will not runat all. To avoid priority inversion problems, SCHED_GIDLE processes havetheir priority temporarily increased whenever they sleep in the kernel (butnot if they are preempted in user space). Networking is managed with theHTBqueueing discipline, augmented with a bunch of bandwidth controllogic. For disks, they are working on proportional I/O scheduling.

Beyond that, a lot of Google's code is there for monitoring. They monitorall disk and network traffic, record it, and use it for analyzing theiroperations later on. Hooks have been added to let them associate all diskI/O back to applications - including asynchronous writeback I/O. Mike wasasked if they could use tracepoints for this task; the answer was "yes,"but, naturally enough, Google is using its own scheme now.

Google has a lot of important goals for 2010; they include:

They are excited about CPU limits; these are intended to give priority access to latency-sensitive tasks while still keeping those tasks from taking over the system entirely.
RPC-aware CPU scheduling; this involves inspection of incoming RPC traffic to determine which process will wake up in response and how important that wakeup is.
A related initiative is delayed scheduling. For most threads, latency is not all that important. But the kernel tries to run them immediately when RPC messages come in; these messages tend not to be evenly distributed across CPUs, leading to serious load balancing problems. So threads can be tagged for delayed scheduling; when a wakeup arrives, they are not immediately put onto the run queue. Instead, the wait until the next global load balancing operation before becoming truly runnable.
Idle cycle injection: high-bandwidth power management so they can run their machines right on the edge of melting down - but not beyond.
Better memory controllers are on the list, including accounting for kernel memory use.
"Offline memory." Mike noted that it is increasingly hard to buy memory which actually works, especially if you want to go cheap. So they need to be able to set bad pages aside. TheHWPOISON work may help them in this area.
They need dynamic huge pages, which can be assembled and broken down on demand.
On the networking side, there is a desire to improve support for receive-side scaling - directing incoming traffic to specific queues. They need to be able to account for software interrupt time and attribute it to specific tasks - networking processing can often involve large amounts of softirq processing. They've been working on better congestion control; the algorithms they have come up with are "not Internet safe" but work well in the data center. And "TCP pacing" slows down outgoing traffic to avoid overloading switches.
For storage, there is a lot of interest in reducing block-layer overhead so it can keep up with high-speed flash. Using flash for disk acceleration in the block layer is on the list. They're looking at in-kernel flash translation layers, though it was suggested that it might be better to handle that logic directly in the filesystem.

Mike concluded with a couple of "interesting problems." One of those isthat Google would like a way to pin filesystem metadata in memory. Theproblem here is being able to bound the time required to service I/Orequests. The time required to read a block from disk is known, but if therelevant metadata is not in memory, more than one disk I/O operation may berequired. That slows things down in undesirable ways. Google is currentlygetting around this by reading file data directly from raw disk devices inuser space, but they would like to stop doing that.

The other problem was lowering the system call overhead for providingcaching advice (withfadvise()) to the kernel. It's not clearexactly what the problem was here.

All told, it was seen as one of the more successful sessions, with thekernel community learning a lot about one of its biggest customers. IfGoogle's plans to become more community-oriented come to fruition, theresult should be a better kernel for all.