A JIT for packet filters

来源:互联网 发布:部落冲突天鹰火炮数据 编辑:程序博客网 时间:2024/04/26 04:34
          

A JIT for packet filters

By Jonathan Corbet
April 12, 2011
The BerkeleyPacket Filter (BPF) is a mechanism for the fast filtering of networkpackets on their way to an application. It has its roots in BSD in thevery early 1990's, a history that was not enough to prevent the SCO Groupfrom claiming ownership of it. Happily, that claim proved to be asvalid as the rest of SCO's assertions, so BPF remains a part of the Linuxnetworking stack. A recent patch from Eric Dumazet may make BPF faster, atleast on 64-bit x86 systems.

The purpose behind BPF is to let an application specify a filteringfunction to select only the network packets that it wants to see. An earlyBPF user was the tcpdump, which used BPF to implement the filtering behindits complex command-line syntax. Other packet capture programs also makeuse of it. On Linux, there is another interesting application of BPF: the"socket filter" mechanism allows an application to filter incoming packetson any type of socket with BPF. In this mode, it can function as a sort ofper-application firewall, eliminating packets before the application eversees them.

The original BPF distribution came in the form of a user-space library, butthe BPF interface quickly found its way into the kernel. When networktraffic is high, there is a lot of value in filtering unwanted packetsbefore they are copied into user space. Obviously, it is also importantthat BPF filters run quickly; every bit of per-packet overhead is going tohurt in a high-traffic situation. BPF was designed to allow a wide varietyof filters while keeping speed in mind, but that does not mean that itcannot be made faster.

BPF defines a virtual machine which is almost Turing-machine-like in itssimplicity. There are two registers: an accumulator and an indexregister. The machine also has a small scratch memory area, an implicitarray containing the packet in question, and a small set of arithmetic,logical, and jump instructions. The accumulator is used for arithmeticoperations, while the index register provides offsets into the packet orinto the scratch memory areas. A very simple BPF program (taken from the1993 USENIXpaper [PDF]) might be:

ldh[12]jeq#ETHERTYPE_IP, l1, l2    l1:ret#TRUE    l2:ret#0

The first instruction loads a 16-bit quantity from offset 12 in the packetto the accumulator; that value is the Ethernet protocol type field. Itthen compares the value to see if the packet is an IP packet or not; IPpackets are accepted, while anything else is rejected. Naturally, filterprograms get more complicated quickly. Header length can vary, so theprogram will have to calculate the offsets of (for example) TCP headervalues; that is where the index register comes into play. Scratch memory(which is the only place a BPF program can store to) is used whenintermediate results must be kept.

The Linux BPF implementation can be found in net/core/filter.c; itprovides "standard" BPF along with a number of Linux-specific ancillaryinstructions which can test whether a packet is marked, which CPU thefilter is running on, which interface the packet arrived on, and more. Itis, at its core, a long switch statement designed to run the BPFinstructions quickly. This code has seen a number of enhancements andspeed improvements over the years, but there has not been any fundamentalchange for a long time.

Eric Dumazet's patch is a fundamentalchange: it puts a just-in-time compiler into the kernel to translate BPFcode directly into the host system's assembly code. The simplicity of theBPF machine makes the JIT translation relatively simple; every BPFinstruction maps to a straightforward x86 instruction sequence. There area few assembly language helpers which help to implement the virtualmachine's semantics; the accumulator and index are just stored in theprocessor's registers. The resulting program is placed in a bit ofvmalloc() space and run directly when a packet is to be tested.A simple benchmark shows a 50ns savings foreach invocation of a simple filter - that may seem small, but, whenmultiplied by the number of packets going through a system, that differencecan add up quickly.

The current implementation is limited to the x86-64 architecture; indeed,that architecture is wired deeply into the code, which is littered withhard-coded x86 instruction opcodes. Should anybody want to add a secondarchitecture, they will be faced with the choice of simply replicating thewhole thing (it is not huge) or trying to add a generalized opcodegenerator to the existing JIT code.

An obvious question is: can this same approach be applied to iptables,which is more heavily used than BPF? The answer may be "yes," but it mightalso make more sense to bring back the nftables idea, which is built on a BPF-likevirtual machine of its own. Given that there has been some talk of usingnftables in other contexts (internal packet classification for packetscheduling, for example), the value of a JIT-translated nftables could beeven higher. Nftables is a job for another day, though; meanwhile, we havea proof of the concept for BPF that appears to get the job done nicely.


(Log in to post comments)

A JIT for packet filters

Posted Apr 14, 2011 6:57 UTC (Thu) by jengelh (subscriber, #33263) [Link]

>An obvious question is: can this same approach be applied to iptables, which is more heavily used than BPF?

Since xtables modules are already handcrafted for a specific task, anyinterpreter module for arbitrary expressions (such as xt_u32 and nft)has a tendency to naturally run slower. But, if BPF can be JITed, itwould seem it not being impossible to extend xt_u32.

A JIT for packet filters

Posted Apr 15, 2011 14:01 UTC (Fri) by Nelson (subscriber, #21712) [Link]

Someone beat me to the punch.. I have been toying with code that doesthis for a few months. Congrats, I hope the community accepts thepatch.

I did some research on this for work about 2 years ago, withsomething like BPF there is a dramatic gain due to the nature of theinterpreter. You can cut a lot of cruft out with JIT. As for genericfirewall rules? It's not as dramatic but you can get a pretty generalacross the board improvement and on some architectures it's definitelyin the interesting range (maybe 30%, maybe more, I guess it depends howfar you go) If you simply recode firewall rules as binary, think aboutit this way: the firewall has a linked list of instructions, all thelinked list code goes away (it's not much, but some memory reads, a fewinstructions here and there) and then as you execute the instructionsthe JIT ones basically turn memory reads into literals so you've candump the loads and some other machinery. It's not warp speed but a nicebump, maybe in the 30%ish generally, it depends on the architecture andthere are a lot of variables. Like I said, it's interesting andnoticeable improvement.

Now where it can be interesting is if you coded up a moreadvanced compiler to optimize the rules. (tcpdump does this, it's uglybut check it out, look at the optimized output some time) a typicalstream of rules might have 10 rules that all apply to TCP packets andthen check various IP ranges and ports. xtables currently would executeeach "instruction" until it reached a result (is packet TCP? doespacket src IP match range Y from rule.. okay go to the next one, is thepacket TCP?... it would check TCP 10 times) A good compiler can invertthat logic and figure out better ways to do it, (if packet TCP? yesthen see if it's in the ranges of IPs from these 10 rules and maybethose rules can be compressed in to just checking a couple bits becauseall the IPs are similar... no it's not TCP, then skip all ten rules andlook at the next batch.) I wouldn't hazard a guess as to how muchfaster this makes the firewall but the potential is HUGE. So we couldmaybe replace iptables with some sort of LLVM based compiler thatgenerated a bytecode "program" that contained the whole firewall.

Whether it's worth the complexity, the difficulty in debugging and porting is a different question.

A JIT for packet filters

Posted Apr 16, 2011 2:55 UTC (Sat) by wahern (subscriber, #37304) [Link]

Theproblem with the benchmark they did was that it's not a fair appraisalof interpretation versus compiling. The BPF switch interpreter isn'tthreaded. That is, at the end of every instruction it jumps back to thewhile loop, which does a conditional branch. Then there's the switch,which may or not may not do one or more conditional branches.

For fair comparison with a JIT compiler, the interpreter would insteadjump directly from one instruction to the next using jumptables--indexing into a table of labels constructed using GCC's labeladdress-of operator, &&.

On my own VM I can dramatically improve performance on many programsmerely by threading the interpreter. If doing this gives the sameperformance, which it very well could given that BPF might be databound and the ops are so simple, then it would be far preferable ratherthan adding hundreds of lines of new code for each architecture (or,conversely, having some architectures needlessly disadvantaged).

A JIT for packet filters

Posted Apr 18, 2011 18:33 UTC (Mon) by Nelson (subscriber, #21712) [Link]

That's a fair criticism, you can make the BPF VM more efficient, it'sstill a comparison of whats there to a JIT though. Even with thoseimprovements, you can get a fairly consistent boost with a JIT, justfrom turning the loads in to literals. It might not be worth thecomplexity but if there was a more generic JIT framework such that theplatform support was there it is an interesting optimization if yourely upon BPF stuff a lot.

A JIT for packet filters

Posted Apr 14, 2011 10:24 UTC (Thu) by Cyberax (subscriber, #52523) [Link]

The obvious solution, of course, is to import LLVM into the kernel. Right?

A JIT for packet filters

Posted Apr 14, 2011 17:48 UTC (Thu) by fuhchee (subscriber, #40059) [Link]

One hopes that security/assurance concerns with a little JIT would bemore manageable than with LLVM. Though you never know whether someoneelse's enthusiasm for importing <whatever> into the kernel mightoverrule such concerns.

A JIT for packet filters

Posted Apr 14, 2011 15:13 UTC (Thu) by trasz (subscriber, #45786) [Link]

Might be worth mentioning that FreeBSD has been using this approach for years.

A JIT for packet filters

Posted Apr 14, 2011 16:28 UTC (Thu) by wahern (subscriber, #37304) [Link]

Links? I can't find any JIT compiler in the FreeBSD kernel, at least skimming code in sys/net/bpf.c or bpf_filter.c.

I'd be interested in this because I'm looking for a tiny JIT compiler.MyJIT is the best I can find so far, but it can't recover from OOMerrors and it's quite large, which means I'm too lazy to fix it.

A JIT for packet filters

Posted Apr 14, 2011 16:37 UTC (Thu) by trasz (subscriber, #45786) [Link]

See http://fxr.watson.org/fxr/source/net/bpf_jitter.c and http://fxr.watson.org/fxr/source/amd64/amd64/bpf_jit_mach....

It was introduced six years ago, with this commit:

r153151 | jkim | 2005-12-06 03:58:12 +0100 (Tue, 06 Dec 2005) | 17 lines

Add experimental BPF Just-In-Time compiler for amd64 and i386.

Use the following kernel configuration option to enable:

options BPF_JITTER

If you want to use bpf_filter() instead (e. g., debugging), do:

sysctl net.bpf.jitter.enable=0

to turn it off.

Currently BIOCSETWF and bpf_mtap2() are unsupported, and bpf_mtap() is
partially supported because 1) no need, 2) avoid expensive m_copydata(9).

Obtained from: WinPcap 3.1 (for i386)

A JIT for packet filters

Posted Apr 14, 2011 16:39 UTC (Thu) by wahern (subscriber, #37304) [Link]

Nevermind

A JIT for packet filters

Posted Apr 15, 2011 9:56 UTC (Fri) by rilder (subscriber, #59804) [Link]

Since generation of the JIT code is a one-time task, can't a usermodehelper in form of something like LLVM be used by the kernel then oratleast be made pluggable ? It will also support multiple architecturesand may generate more optimized code

A JIT for packet filters

Posted Apr 17, 2011 1:40 UTC (Sun) by jzbiciak (✭ supporter ✭, #5246) [Link]

I can understand some folks being, erm, jittery about being able toload arbitrary code into the kernel. Then again, we have loadablekernel modules, so why not?

I agree that doing this in userspace seems to make much more sense thandoing it in the kernel if optimized performance is your main careabout,since you can bring more resources to bear on the problem withoutbloating the kernel. It then comes down to managing the potentialsecurity issues, and trusting the correctness of the translator sinceyou lose any sandboxing the interpreter might have offered.

(Yes, the translator can insert the required bounds checks, but nothingrequires it to if you're loading raw machine code into the kernel.)

A JIT for packet filters

Posted Apr 17, 2011 21:08 UTC (Sun) by rilder (subscriber, #59804) [Link]

Good points.
My thought process for this was influenced by:
1. Text processing algorithms which use request_module() to load atruntime for algorithms which are not in kernel. Again, if proper caseis exercised here -- not loading outside modprobe path etc. it shouldbe fine.
2. Coming back to usermode helpers, we already allow modules to bemodprbed through external helpers, so a similar approach can be used.If someone can write to a sysctl/procfs maliciously, then system isalready compromised. I was thinking of reading from a pipe using ausermode helper similar to how core dumping function uses it to writeinstead.

A JIT for packet filters

Posted Apr 26, 2011 2:16 UTC (Tue) by welinder (guest, #4699) [Link]

People at CMU played with that years and years ago. The programs
from user space would only be accepted if they came with a proof
of correctness (which is easy to verify). The buzz words were
"proof-carrying code", I think.

A JIT for packet filters

Posted May 21, 2011 11:56 UTC (Sat) by snemarch (guest, #75085) [Link]

Have the usermode LLVM-compiler generate pf bytecode, and have a smallJIT'er in the kernel? Sounds like a reasonably safe scheme to me.

why does this need a JIT compiler?

Posted Apr 28, 2011 6:07 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]

my understanding of the definition of a JIT compiler is that the systemstarts to run the interpreted code and compiles it as it goes, thenuses the compiled version in the future.

filters change very infrequently, so why do you need a JIT compiler instead of a normal compiler?

am I missing something on the definition here? or are they misusing theterm JIT? or are they using a JIT setup when they could just as easilyuse a normal compiler?

why does this need a JIT compiler?

Posted May 21, 2011 12:00 UTC (Sat) by snemarch (guest, #75085) [Link]

JIT isn't a super precisely defined term.

In this context, the "just in time" means the code is not compiledinto the kernel (or as a LKM), but generated from user data. You don'tneed the Java/.NET style "interpret until determined hotspot, thengenerate native" behavior in order to call something JIT :-)



the original link:https://lwn.net/Articles/437981/