Receive packet steering

来源:互联网 发布:电脑mac地址是什么意思 编辑:程序博客网 时间:2024/05/18 09:26

最近看了《Linux设备驱动程序》,《深入理解Linux内核》和《深入理解Linux网络》,里面都没有讲述如何将网卡中断分配到多CPU上处理的逻辑,就像Windows上的RSS功能,所以在网上搜索到下面的文章,完整的解答了这个问题。

Contemporary networking hardware can move a lot of packets, to the point that the host computer can have a hard time keeping up.  In recent years, CPU speeds have stopped increasing, but the number of CPU cores is growing.  The implication is clear: if the networking stack is to be able to keep up with the hardware, smarter processing (such asgeneric receive offload) will not be enough; the system must also be able to distribute the work across multiple processors.  Tom Herbert'sreceive packet steering (RPS) patch aims to help make that happen.

From the operating system's point of view, distributing the work of outgoing data across CPUs is relatively straightforward.  The processes generating data will naturally spread out across the system, so the networking stack does not need to think much about it, especially now that multiple transmit queues are supported.  Incoming data is harder to distribute, though, because it is coming from a single source. Some network interfaces can help with the distribution of incoming packets; they have multiple receive queues and multiple interrupt lines.  Others, though, are equipped with a single queue, meaning that the driver for that hardware must deal with all incoming packets in a single, serialized stream.  Parallelizing such a stream requires some intelligence on the part of the host operating system.

Tom's patch provides that intelligence by hooking into the receive path -netif_rx() andnetif_receive_skb() - right when the driver passes a packet into the networking subsystem.  At that point, it creates a hash from the relevant protocol data (IP addresses and port numbers, in particular) and uses it to pick a CPU; the packet is then enqueued for the target CPU's attention.  By default, any CPU on the system is fair game for network processing, but the list of target CPUs for any given interface can be configured explicitly by the administrator if need be.

The code is relatively simple, but it succeeds in distributing the load of receive processing across the system.  The use of the hash is important: it ensures that packets for the same stream of data end up on the same processor, increasing cache locality (and, thus, performance).  This scheme is also nice in that it requires no driver changes at all, so it can be deployed quickly and with minimal disruption.

There is one place where drivers can help, though.  The calculation of the hash requires accessing data from the packet header.  That access will necessarily involve one or more cache misses on the CPU running the steering code - that data was just put there by the network interface and thus cannot be in any CPU's cache.  Once the packet has been passed over to the CPU which will be doing the real work, that cache miss overhead is likely to be incurred again.  Unnecessary cache misses are the bane of high-speed network processing; quite a bit of work has been done to eliminate them wherever possible.  Adding a new cache miss for every packet in the steering code would be counterproductive.

It turns out that a number of network interfaces can, themselves, calculate a hash value for incoming packets.  That processing comes for free, and it could eliminate the need to calculate that hash (and suffer the overhead of accessing the data) on the dispatching processor.  To take advantage of this capability, the RPS patch adds a newrxhash field to thesk_buff (SKB) structure.  Drivers which are able to obtain hash values from the hardware can place them in the SKB; the network stack will then skip the calculation of its own hash value.  That should keep the packet's data out of the dispatching CPU's cache entirely, speeding processing.

How well does this work?  The patch included some benchmark results using the netperf tool.  An 8-core server with a tg3-based network interface went from 90,000 transactions per second to 285,000; an e1000-based adapter on the same system went from 90,000 to 292,000. Similar results are obtained for nForce and bnx2x chipsets on 16-core servers.  It would appear that this patch does succeed in making networking processing faster on multi-core systems.

The patch, incidentally, comes from Google, which has a bit of experience with network processing.  It has, evidently, been running on Google's production servers for a while.  So the RPS patch is, hopefully, an early component of what will be a broad stream of contributions from Google as that company tries to work more closely with the mainline.  It seems like a good start.

====================

Q: Some further discussion about MSI-X:

How is this related to MSI-X, the system whereby network cards can assert different MSI interrupts based on a checksum in the header. This allows the load to be spread across CPUs in much the same way as suggested above.

I'm also wondering how this interacts with PCAP. If you have a machine with a dozen processes attached to an interface, then the packet needs to be copied to several different places in userspace (assuming MMAP ring-buffers). These are all going to be running on different CPUs so I don't think the above processing will help. But perhaps the actual BPF filtering can be spread out over multiple CPUs?

I ran into a problem this week, where IO-APIC round-robin interrupt routing is disabled on machines with >= 8 CPUs, which means if you don't have MSI-X you have >50% of one CPU dedicated to interrupt processing. The scheduler doesn't know this, leading to some odd effects. So if the above system works on ordinary MSI network cards this could be a solution,

A: I think maybe you are referring to networking cards that provide multiple receive queues where each one can have a separate interrupt.  In some sense, RPS is a an emulation of this which is useful for "legacy" NICs that have only one queue.  Even so, we found that certain combinations if HW multiqueue and RPS actually can have better performance than just using HW multiqueue alone.

With regards to PCAP, it's possible it may not help.  However, the technique of moving packets between CPUs might be applicable at a higher layer.

This solution should definitely help if you don't have round robin interrupts (it's actually better because parallelism in the receive path).

 

From: http://lwn.net/Articles/362339/
原创粉丝点击