Berkeley Packet Filter

来源：互联网发布：語音英文软件编辑：程序博客网时间：2024/06/09 20:17

原文链接： http://howtounix.info/man/FreeBSD/man4/bpf.4

NAME

bpf — Berkeley Packet Filter

SYNOPSIS

device bpf

DESCRIPTION

BPF提供独立于协议的面向datalink层的原始接口(raw interface).网络中的所有package，甚至是其他主机的package，都可以使用这个机制。

BPF以字符设备的形式出现，/dev/bpf。打开一个设备之后，一个文件描述符必须通过BIOCSETIF ioctl来绑定一个指定的网络接口。一个给定的接口可以被多个listeners共享，并且在文件描述符的底层将会看到一个独立的package流。

每个minor device又有一个特定的设备。如果一个文件被使用了，打开将会失败，并且返回EBUSY错误。

与每个打开的实体(open instance)关联的是一个user-settable package filter.当一个package被一个interface接收到时候，所有监听(listen)那个interface的文件描述符都会apply他们的filter。每个接收package的文件描述符都会获得它自己的copy。

Package Filter将会支持任何有固定长度header的链路层协议。目前，只支持Ethernet, SLIP, and PPP。

由于package中的数据是网络顺序的，application应该使用byteorder(3)宏来获得多byte的值。

一个package可以通过写入一个bpf文件描述符来发送到网络中。写操作是unbuffered,所以一次只能处理一个package。目前写操作只支持Ethernets 和SLIP

BUFFER MODES

bpf设备通过有application提供的memory buffer将数据传到上层。Buffer Mode使用BIOCSETBUFMODE ioctl设置，并且使用BIOCGETBUFMODE ioctl.读数据。

Buffered read mode

默认，bpf设备在BPF_BUFMODE_BUFFER模式下操作，在这种模式下，package数据使用read(2)从kernel拷贝到用户空间，用户程序将需要定义一个固定长度的buffer被用于定义内部buffer(size internal buffer)以及所欲read(2)操作。可以通过BIOCGBLEN ioctl来获取size，并且使用BIOCSBLEN ioctl来设置。注意：如果一个package大小大于buffer的大小，那么这个package必须截断。

Zero-copy buffer mode

BPF设备同样也可以使用BPF_BUFMODE_ZEROCOPY模式，在这种模式下，package数据直接由内核写到两个user memory，同时避免了系统调用和过度copy。所有的buffer都是固定(相等)的大小，页对齐，甚至是页的数倍大小。通过BIOCGETZMAX ioctl获得最大零拷贝buffer大小。注意：如果一个package大小大于buffer的大小，那么这个package必须截断。

User进程通过BIOCSETZBUF ioctl注册两个memory buffer，以struct bpf_zbuf作为参数：

struct bpf_zbuf { void *bz_bufa; void *bz_bufb; size_t bz_buflen; };

bz_bufa是第一个用于被填充(fill)的buffer的用户空间地址，bz_bufb是第二个。当他们被填充(fill)和确认(acknowledged)时bpf将会在这两个buffer中循环使用。

每一个buffer又一个固定长度的header用于保存buffer的同步和数据长度等信息：

struct bpf_zbuf_header { volatile u_int  bzh_kernel_gen;/* Kernel generation number. */ volatile u_int  bzh_kernel_len;/* Length of data in the buffer. */ volatile u_int  bzh_user_gen;/* User generation number. */ /* ...padding for future use... */ };

在配置成BIOCSETZBUF之前，每一个buffer的header structure，包含填充，都应该置零。buffer中剩下的空间将会被kernel用来保存package data。与buffered读模式的布局一致。

内核和用户进程通过访问buffer header这么一个简单确认协议(acknowledge protocol)来访问数据。当内核生成的数字(Kernel generation number)，bzh_kernel_gen和bzh_user_gen相等，内核拥有buffer。否则，用户空间拥有buffer。

当内核拥有buffer，其内容(contents)是不稳定的，并且可能异步改变；当用户空间持有buffer的时候，它的内容是稳定的，并且知道buffer被确认(acknowledge)前是不会改变的。

把buffer headers在注册之前(registering)初始化为0有指定两个buffer的持有者为内核的作用。内核通过改变bzh_kernel_gen来通知一个buffer已经被指定给用户空间，用户空间确认(acknowledge)buffer之后通过把bzh_user_gen设置为bzh_kernel_gen将buffer返回给内核。

为了避免缓存(caching)以及memory re-order，当检查(check)和确认(acknowledge)buffer时，用户进程必须使用原子操作，以及memory barriers。

#include <machine/atomic.h>  /*  * Return ownership of a buffer to the kernel for reuse.  */ static void buffer_acknowledge(struct bpf_zbuf_header *bzh) {  atomic_store_rel_int(&bzh->bzh_user_gen, bzh->bzh_kernel_gen); }  /*  * Check whether a buffer has been assigned to userspace by the kernel.  * Return true if userspace owns the buffer, and false otherwise.  */ static int buffer_check(struct bpf_zbuf_header *bzh) {  return (bzh->bzh_user_gen !=     atomic_load_acq_int(&bzh->bzh_kernel_gen)); }

如果有数据pending，用户空间进程可以通过BIOCROTZBUF ioctl强制拥有下一个将被使用的buffer。这样允许用户进程在全部填充(fill)之前获得部分数据，比如使用一个timeout。用户进程必须再次检查header generation numbers，这是因为如果没有数据，buffer不会被用户进程拥有。

当使用buffered read mode时，kqueue(2), poll(2), and select(2)可以用来等待完成的buffer。当buffer被给予用户空间时，他们会返回一个文件描述符。

在当前实现中，内核也许会会将0，1，2两个buffer赋予用户空间；然而早期的实现维持一个固定的规则--一次最多将一个buffer赋予用户进程。为了确保流程(progress)和性能，用户进程应该尽早确认(acknowledge)一个处理完的buffer，将它返回给内核。而不是拥有一个buffer然后阻塞等待第二个buffer。

IOCTLS

The ioctl(2) command codes below are defined in <net/bpf.h>. All commands require these includes:

#include <sys/types.h> #include <sys/time.h> #include <sys/ioctl.h> #include <net/bpf.h>

除此之外， BIOCGETIF and BIOCSETIF需要<sys/socket.h> 和<net/if.h>

除了FIONREAD和SIOCGIFADDR，下列的命令可以应用于任何打开的bpf文件。ioctl的第三个参数应该是一个指向指定类型的指针。

(u_int) Returns the required buffer length for reads on bpf files.

(u_int) Sets the buffer length for reads on bpf files. The buffer must be set before the file is attached to an interface with BIOCSETIF. If the requested buffer size cannot be accommodated, the closest allowable size will be set and returned in the argument. A read call will result in EIO if it is passed a buffer that is not this size.

(u_int) Returns the type of the data link layer underlying the attached interface. EINVAL is returned if no interface has been specified. The device types, prefixed with "DLT_", are defined in <net/bpf.h>.

(struct ifreq) Returns the name of the hardware interface that the file is listening on. The name is returned in the ifr_name field of the ifreq structure. All other fields are undefined.

(struct ifreq) Sets the hardware interface associate with the file. This command must be performed before any packets can be read. The device is indicated by name using the ifr_name field of the ifreq structure. Additionally, performs the actions of BIOCFLUSH.

(struct timeval) Set or get the read timeout parameter. The argument specifies the length of time to wait before timing out on a read request. This parameter is initialized to zero by open(2), indicating no timeout.

(struct bpf_stat) Returns the following structure of packet statistics:

struct bpf_stat { u_int bs_recv;    /* number of packets received */ u_int bs_drop;    /* number of packets dropped */ };

The fields are:

bs_recv: the number of packets received by the descriptor since opened or reset (including any buffered since the last read call); and
bs_drop: the number of packets which were accepted by the filter but dropped by the kernel because of buffer overflows (i.e., the application's reads are not keeping up with the packet traffic).

(u_int) Enable or disable "immediate mode", based on the truth value of the argument. When immediate mode is enabled, reads return immediately upon packet reception. Otherwise, a read will block until either the kernel buffer becomes full or a timeout occurs. This is useful for programs like rarpd(8) which must respond to messages in real time. The default for a new file is off.

(struct bpf_program) Sets the read filter program used by the kernel to discard uninteresting packets. An array of instructions and its length is passed in using the following structure:

struct bpf_program { int bf_len; struct bpf_insn *bf_insns; };

The filter program is pointed to by the bf_insns field while its length in units of ‘struct bpf_insn' is given by the bf_len field. See section FILTER MACHINE for an explanation of the filter language. The only difference between BIOCSETF and BIOCSETFNR is BIOCSETF performs the actions of BIOCFLUSH while BIOCSETFNRdoes not.

(struct bpf_program) Sets the write filter program used by the kernel to control what type of packets can be written to the interface. See the BIOCSETF command for more information on the bpf filter program.

(struct bpf_version) Returns the major and minor version numbers of the filter language currently recognized by the kernel. Before installing a filter, applications must check that the current version is compatible with the running kernel. Version numbers are compatible if the major numbers match and the application minor is less than or equal to the kernel minor. The kernel version number is returned in the following structure:

struct bpf_version {         u_short bv_major;         u_short bv_minor; };

The current version numbers are given by BPF_MAJOR_VERSION and BPF_MINOR_VERSION from<net/bpf.h>. An incompatible filter may result in undefined behavior (most likely, an error returned by ioctl() or haphazard packet matching).

(u_int) Set or get the status of the "header complete" flag. Set to zero if the link level source address should be filled in automatically by the interface output routine. Set to one if the link level source address will be written, as provided, to the wire. This flag is initialized to zero by default.

(u_int) These commands are obsolete but left for compatibility. Use BIOCSDIRECTION and BIOCGDIRECTIONinstead. Set or get the flag determining whether locally generated packets on the interface should be returned by BPF. Set to zero to see only incoming packets on the interface. Set to one to see packets originating locally and remotely on the interface. This flag is initialized to one by default.

(u_int) Set or get the setting determining whether incoming, outgoing, or all packets on the interface should be returned by BPF. Set to BPF_D_IN to see only incoming packets on the interface. Set to BPF_D_INOUT to see packets originating locally and remotely on the interface. Set to BPF_D_OUT to see only outgoing packets on the interface. This setting is initialized to BPF_D_INOUT by default.

(u_int) Set or get format and resolution of the time stamps returned by BPF. Set to BPF_T_MICROTIME,BPF_T_MICROTIME_FAST, BPF_T_MICROTIME_MONOTONIC, or BPF_T_MICROTIME_MONOTONIC_FAST to get time stamps in 64-bit struct timeval format. Set to BPF_T_NANOTIME, BPF_T_NANOTIME_FAST,BPF_T_NANOTIME_MONOTONIC, or BPF_T_NANOTIME_MONOTONIC_FAST to get time stamps in 64-bit struct timespec format. Set to BPF_T_BINTIME, BPF_T_BINTIME_FAST, BPF_T_NANOTIME_MONOTONIC, orBPF_T_BINTIME_MONOTONIC_FAST to get time stamps in 64-bit struct bintime format. Set to BPF_T_NONE to ignore time stamp. All 64-bit time stamp formats are wrapped in struct bpf_ts. The BPF_T_MICROTIME_FAST,BPF_T_NANOTIME_FAST, BPF_T_BINTIME_FAST, BPF_T_MICROTIME_MONOTONIC_FAST,BPF_T_NANOTIME_MONOTONIC_FAST, and BPF_T_BINTIME_MONOTONIC_FAST are analogs of corresponding formats without _FAST suffix but do not perform a full time counter query, so their accuracy is one timer tick. The BPF_T_MICROTIME_MONOTONIC, BPF_T_NANOTIME_MONOTONIC,BPF_T_BINTIME_MONOTONIC, BPF_T_MICROTIME_MONOTONIC_FAST,BPF_T_NANOTIME_MONOTONIC_FAST, and BPF_T_BINTIME_MONOTONIC_FAST store the time elapsed since kernel boot. This setting is initialized to BPF_T_MICROTIME by default.

(u_int) Set packet feedback mode. This allows injected packets to be fed back as input to the interface when output via the interface is successful. When BPF_D_INOUT direction is set, injected outgoing packet is not returned by BPF to avoid duplication. This flag is initialized to zero by default.

(u_int) Get or set the current bpf buffering mode; possible values are BPF_BUFMODE_BUFFER, buffered read mode, and BPF_BUFMODE_ZBUF, zero-copy buffer mode.

(struct bpf_zbuf) Set the current zero-copy buffer locations; buffer locations may be set only once zero-copy buffer mode has been selected, and prior to attaching to an interface. Buffers must be of identical size, page-aligned, and an integer multiple of pages in size. The three fields bz_bufa, bz_bufb, and bz_buflen must be filled out. If buffers have already been set for this device, the ioctl will fail.

(size_t) Get the largest individual zero-copy buffer size allowed. As two buffers are used in zero-copy buffer mode, the limit (in practice) is twice the returned size. As zero-copy buffers consume kernel address space, conservative selection of buffer size is suggested, especially when there are multiple bpf descriptors in use on 32-bit systems.

下面的structure是read(2)或通过零拷贝获得的package的前缀

struct bpf_xhdr { struct bpf_tsbh_tstamp;     /* time stamp */ uint32_tbh_caplen;     /* length of captured portion */ uint32_tbh_datalen;    /* original length of packet */ u_shortbh_hdrlen;     /* length of bpf header (this struct   plus alignment padding */ };  struct bpf_hdr { struct timevalbh_tstamp;     /* time stamp */ uint32_tbh_caplen;     /* length of captured portion */ uint32_tbh_datalen;    /* original length of packet */ u_shortbh_hdrlen;     /* length of bpf header (this struct   plus alignment padding */ };

每个成员，他们的值是按照主机顺序存储的(host order)：

bh_tstamp

bh_caplen

bh_datalen

bh_hdrlen

bh_hdrlen被用于计算header和链路层协议之间的填充(padding)长度。目的是保证在对齐敏感度架构(senstitive architecture)package data的恰当对齐，以及在许多其他架构下提高性能。package filter确保bpf_xhdr, bpf_hdr和network layer header word对齐。当前，只有当timestamp被置为BPF_T_MICROTIME, BPF_T_MICROTIME_FAST, BPF_T_MICROTIME_MONOTONIC, BPF_T_MICROTIME_MONOTONIC_FAST, or BPF_T_NONE时为了向前(backward)兼容才使用bpf_hdr。否则使用bpf_xhdr。然而，bpf_hdr也学在不久将来将会删除。当访问有对齐限制的机器上的链路层时，必须采用适当的前缀。(对于Ethernet不是一个问题，因为type field是一个short类型的，从而是一个偶数的偏移。并且地址可能是基于字节的形式)。

另外，为了使每一个package以一个word边界开始，package被填充(padded)。这就要求application知道如何访问package中的数据。定义在<net/bpf.h>中的宏BPF_WORDALIGN用于实现这个过程。他向前进位(round up)到最近的word aligned值(一个word是BPF_ALIGNMENT字节宽)。

例如，如果p指向一个package的开始，这个表达式会得到下一个package的地址

p = (char *)p + BPF_WORDALIGN(p->bh_hdrlen + p->bh_caplen)

为了使对齐机制恰当工作，传递给read(2)的参数必须是word对齐的。malloc(3)函数总是会返回一个对齐的buffer。

FILTER MACHINE

一个filter程序是一组指令，所有的指令都是向前的(forwardly directed)，通过return指令终止。指令基于伪机状态(pseudo-machine state)执行指令，指令由累加器(accumulator)，索引寄存器(index register)，暂时存储器(scratch memory)以及程序计数器组成。

下列结构定义指令格式:

struct bpf_insn { u_shortcode; u_char jt; u_char jf; u_long k; };

k在不同的指令中有不同的用途，jt和jf在分支指令中被当作offset。opcode以半层(semi-hierarchical)的形式编码。有8组指令。BPF_LD, BPF_LDX, BPF_ST, BPF_STX, BPF_ALU, BPF_JMP, BPF_RET, 和BPF_MISC。其他的模式和操作以“位与”和“位或”的方式来执行实际操作。指令组和模式在<net/bpf.h>中定义。

下面是每个bpf指令的语义。

A表示累加器(accumulator)，X是索引寄存器(index register)，p[]是package data，M[]是暂时存储器(scratch memory store)。P[i:n]值在package中i偏移的数据，word时(n=4), unsigned halfword(n=2)，或unsigned byte(n=1)。M[i]指在暂时存储器中第i个word，它是word units中的唯一寻址。存储器中的索引是0～BPF_MEMWORDS - 1。k，jt和jf在指令的定义中有对应的field。“len”指package的长度。

BPF_LD

These instructions copy a value into the accumulator. The type of the source operand is specified by an "addressing mode" and can be a constant (BPF_IMM), packet data at a fixed offset (BPF_ABS), packet data at a variable offset (BPF_IND), the packet length (BPF_LEN), or a word in the scratch memory store (BPF_MEM). For BPF_IND andBPF_ABS, the data size must be specified as a word (BPF_W), halfword (BPF_H), or byte (BPF_B). The semantics of all the recognized BPF_LD instructions follow.

BPF_LD+BPF_W+BPF_ABSA <- P[k:4] BPF_LD+BPF_H+BPF_ABSA <- P[k:2] BPF_LD+BPF_B+BPF_ABSA <- P[k:1] BPF_LD+BPF_W+BPF_INDA <- P[X+k:4] BPF_LD+BPF_H+BPF_INDA <- P[X+k:2] BPF_LD+BPF_B+BPF_INDA <- P[X+k:1] BPF_LD+BPF_W+BPF_LENA <- len BPF_LD+BPF_IMMA <- k BPF_LD+BPF_MEMA <- M[k]

BPF_LDX

These instructions load a value into the index register. Note that the addressing modes are more restrictive than those of the accumulator loads, but they include BPF_MSH, a hack for efficiently loading the IP header length.

BPF_LDX+BPF_W+BPF_IMMX <- k BPF_LDX+BPF_W+BPF_MEMX <- M[k] BPF_LDX+BPF_W+BPF_LENX <- len BPF_LDX+BPF_B+BPF_MSHX <- 4*(P[k:1]&0xf)

BPF_ST

This instruction stores the accumulator into the scratch memory. We do not need an addressing mode since there is only one possibility for the destination.

BPF_STM[k] <- A

BPF_STX

This instruction stores the index register in the scratch memory store.

BPF_STXM[k] <- X

BPF_ALU

The alu instructions perform operations between the accumulator and index register or constant, and store the result back in the accumulator. For binary operations, a source mode is required (BPF_K or BPF_X).

BPF_ALU+BPF_ADD+BPF_KA <- A + k BPF_ALU+BPF_SUB+BPF_KA <- A - k BPF_ALU+BPF_MUL+BPF_KA <- A * k BPF_ALU+BPF_DIV+BPF_KA <- A / k BPF_ALU+BPF_AND+BPF_KA <- A & k BPF_ALU+BPF_OR+BPF_KA <- A | k BPF_ALU+BPF_LSH+BPF_KA <- A << k BPF_ALU+BPF_RSH+BPF_KA <- A >> k BPF_ALU+BPF_ADD+BPF_XA <- A + X BPF_ALU+BPF_SUB+BPF_XA <- A - X BPF_ALU+BPF_MUL+BPF_XA <- A * X BPF_ALU+BPF_DIV+BPF_XA <- A / X BPF_ALU+BPF_AND+BPF_XA <- A & X BPF_ALU+BPF_OR+BPF_XA <- A | X BPF_ALU+BPF_LSH+BPF_XA <- A << X BPF_ALU+BPF_RSH+BPF_XA <- A >> X BPF_ALU+BPF_NEGA <- -A

BPF_JMP

The jump instructions alter flow of control. Conditional jumps compare the accumulator against a constant (BPF_K) or the index register (BPF_X). If the result is true (or non-zero), the true branch is taken, otherwise the false branch is taken. Jump offsets are encoded in 8 bits so the longest jump is 256 instructions. However, the jump always (BPF_JA) opcode uses the 32 bit k field as the offset, allowing arbitrarily distant destinations. All conditionals use unsigned comparison conventions.

BPF_JMP+BPF_JApc += k BPF_JMP+BPF_JGT+BPF_Kpc += (A > k) ? jt : jf BPF_JMP+BPF_JGE+BPF_Kpc += (A >= k) ? jt : jf BPF_JMP+BPF_JEQ+BPF_Kpc += (A == k) ? jt : jf BPF_JMP+BPF_JSET+BPF_Kpc += (A & k) ? jt : jf BPF_JMP+BPF_JGT+BPF_Xpc += (A > X) ? jt : jf BPF_JMP+BPF_JGE+BPF_Xpc += (A >= X) ? jt : jf BPF_JMP+BPF_JEQ+BPF_Xpc += (A == X) ? jt : jf BPF_JMP+BPF_JSET+BPF_Xpc += (A & X) ? jt : jf

BPF_RET

The return instructions terminate the filter program and specify the amount of packet to accept (i.e., they return the truncation amount). A return value of zero indicates that the packet should be ignored. The return value is either a constant (BPF_K) or the accumulator (BPF_A).

BPF_RET+BPF_Aaccept A bytes BPF_RET+BPF_Kaccept k bytes

BPF_MISC

The miscellaneous category was created for anything that does not fit into the above classes, and for any new instructions that might need to be added. Currently, these are the register transfer instructions that copy the index register to the accumulator or vice versa.

BPF_MISC+BPF_TAXX <- A BPF_MISC+BPF_TXAA <- X

The bpf interface provides the following macros to facilitate array initializers: BPF_STMT(opcode, operand) andBPF_JUMP(opcode, operand, true_offset, false_offset).

FILES

EXAMPLES

The following filter is taken from the Reverse ARP Daemon. It accepts only Reverse ARP requests.

struct bpf_insn insns[] = { BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12), BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_REVARP, 0, 3), BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 20), BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, REVARP_REQUEST, 0, 1), BPF_STMT(BPF_RET+BPF_K, sizeof(struct ether_arp) +  sizeof(struct ether_header)), BPF_STMT(BPF_RET+BPF_K, 0), };

This filter accepts only IP packets between host 128.3.112.15 and 128.3.112.35.

struct bpf_insn insns[] = { BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12), BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_IP, 0, 8), BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 26), BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x8003700f, 0, 2), BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 30), BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x80037023, 3, 4), BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x80037023, 0, 3), BPF_STMT(BPF_LD+BPF_W+BPF_ABS, 30), BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x8003700f, 0, 1), BPF_STMT(BPF_RET+BPF_K, (u_int)-1), BPF_STMT(BPF_RET+BPF_K, 0), };

Finally, this filter returns only TCP finger packets. We must parse the IP header to reach the TCP header. The BPF_JSETinstruction checks that the IP fragment offset is 0 so we are sure that we have a TCP header.

struct bpf_insn insns[] = { BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12), BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_IP, 0, 10), BPF_STMT(BPF_LD+BPF_B+BPF_ABS, 23), BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, IPPROTO_TCP, 0, 8), BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 20), BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, 0x1fff, 6, 0), BPF_STMT(BPF_LDX+BPF_B+BPF_MSH, 14), BPF_STMT(BPF_LD+BPF_H+BPF_IND, 14), BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 79, 2, 0), BPF_STMT(BPF_LD+BPF_H+BPF_IND, 16), BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 79, 0, 1), BPF_STMT(BPF_RET+BPF_K, (u_int)-1), BPF_STMT(BPF_RET+BPF_K, 0), };