Documentation/Bpfc

   1 What is bpfc?
   2 /////////////
   3
   4 (derived from ftp://ftp.tik.ee.ethz.ch/pub/students/2011-FS/MA-2011-01.pdf)
   5
   6 Within this work, we have developed a Berkeley Packet Filter (BPF) compiler.
   7 Berkeley Packet Filters as described in literature [31] are small
   8 assembler-like programs that can be interpreted by a virtual machine within
   9 the Linux (or *BSD) kernel [123]. Their main application is to filter out
  10 packets on an early point in the receive path as described in the principle
  11 of early demultiplexing in 2.1.1. However, the operating system kernel can
  12 only work with an array of translated mnemonics that operate on socket buffer
  13 memory. The canonical format of such a single translated instruction is a
  14 tuple of Bit fields of opcode, jt, jf and k and looks like:
  15
  16  +-----------+------+------+---------------------------------+
  17  | opcode:16 | jt:8 | jf:8 | k:32                            |
  18  +-----------+------+------+---------------------------------+
  19
  20 Here, opcode is the numerical representation of an instruction and has a width
  21 of 16 Bit, jt and jf are both jump offset fields (to other instruction tuples)
  22 of 8 Bit width each and k is an optional instruction-specific argument of
  23 32 Bit. The BPF virtual machine construct consists of a 32 Bit accumulator
  24 register, of a 32 Bit index register and of a so-called scratch memory store
  25 that can hold up to 16 32 Bit entries. How each component can be used is
  26 described later in this section.
  27
  28 Since writing filter programs in this numerical format by hand is error-prone
  29 and needs lots of efforts, it is only obvious to have a compiler that
  30 translates McCanne et al.’s [31] syntax into the canonical format. To our
  31 knowledge, the only available compiler is integrated into libpcap [120] and
  32 can be accessed by tools like tcpdump for instance:
  33
  34    #  tcpdump -i eth10 -dd arp
  35    {  0x28, 0, 0, 0x0000000c },
  36    {  0x15, 0, 1, 0x00000806 },
  37    {  0x06, 0, 0, 0x0000ffff },
  38    {  0x06, 0, 0, 0x00000000 },
  39
  40 This example demonstrates how a simple filter looks like, that can be used for
  41 ARP-typed packets. While we believe that tcpdumps filter syntax [124] might be
  42 appropriate for most simple cases, it does not provide the full flexibility and
  43 expressiveness of the BPF syntax from [31] as stated in discussions on the
  44 tcpdump-workers mailing list [125]. To our knowledge, since 1992 there is no
  45 such compiler available that is able to translate from McCanne et al.’s
  46 syntax [31] into the canonical format. Even though the BPF paper is from 1992,
  47 Berkeley Packet Filters are still heavily used nowadays. Up to the latest
  48 versions of Linux, the BPF language is still the only available filter language
  49 that is supported for sockets in the kernel. In 2011, the Eric Dumazet has
  50 developed a BPF Just-In-Time compiler for the Linux kernel [106]. With this,
  51 BPF programs that are attached to PF PACKET sockets (packet(7)) are directly
  52 translated into executable x86/x86 64 or PowerPC assembler code in order to
  53 speed up the execution of filter programs. A first benchmark by Eric Dumazet
  54 showed that 50 ns are saved per filter evaluation on an Intel Xeon DP E5540
  55 with 2.53 GHz clock rate [126]. This seems a small time interval on the first
  56 hand, but it can significantly increase processing capabilities on a high
  57 packet rate.
  58
  59 In order to close the gap of a BPF compiler, we have developed a compiler
  60 named bpfc that is able to understand McCanne et al.’s syntax and semantic.
  61 The instruction set of [31] contains the following possible instructions:
  62
  63 Instr.  Addressing Mode         Description
  64
  65 --------------------Load Instructions------------------------------------------
  66 ldb    [k] or [x + k]           Load (unsigned) byte into accumulator
  67 ldh    [k] or [x + k]           Load (unsigned) half word into accumulator
  68 ld     #k or #len or            Load (unsigned) word into accumulator
  69        M[k] or [k] or [x + k]
  70 ldx    #k or #len or            Load (unsigned) word into index register
  71        M[k] or 4*([k]&0xf)
  72
  73 --------------------Store Instructions-----------------------------------------
  74 st     M[k]                     Copy accumulator into scratch memory store
  75 stx    M[k]                     Copy index register into scratch memory store
  76
  77 --------------------Branch Instructions----------------------------------------
  78 jmp    L                        Jump to label L (in every case)
  79 jeq    #k,Lt,Lf                 Jump to Lt if equal (to accu.), else to Lf
  80 jgt    #k,Lt,Lf                 Jump to Lt if greater than, else to Lf
  81 jge    #k,Lt,Lf                 Jump to Lt if greater than or equal, else to Lf
  82 jset   #k,Lt,Lf                 Jump to Lt if bitwise and, else to Lf
  83
  84 --------------------ALU Instructions-------------------------------------------
  85 add    #k or x                  Addition applied to accumulator
  86 sub    #k or x                  Subtraction applied to accumulator
  87 mul    #k or x                  Multiplication applied to accumulator
  88 div    #k or x                  Division applied to accumulator
  89 mod    #k or x                  Modulo applied to accumulator (Linux-only)
  90 neg                             Negate accumulator (Linux-only)
  91 and    #k or x                  Bitwise and applied to accumulator
  92 or     #k or x                  Bitwise or applied to accumulator
  93 xor    #k or x                  Bitwise xor applied to accumulator (Linux-only)
  94 lsh    #k or x                  Left shift applied to accumulator
  95 rsh    #k or x                  Right shift applied to accumulator
  96
  97 --------------------Return Instructions----------------------------------------
  98 ret    #k or a                  Return
  99
 100 --------------------Miscellaneous Instructions---------------------------------
 101 tax                             Copy accumulator to index register
 102 txa                             Copy index register to accumulator
 103
 104 Next to this, mentioned addressing modes have the following meaning [31]:
 105
 106 Mode        Description
 107
 108 #k          Literal value stored in k
 109 #len        Length of the packet
 110 x           Word stored in the index register
 111 a           Word stored in the accumulator
 112 M[k]        Word at offset k in the scratch memory store
 113 [k]         Byte, halfword, or word at byte offset k in the packet
 114 [x + k]     Byte, halfword, or word at the offset x+k in the packet
 115 L           Offset from the current instruction to L
 116 #k,Lt,Lf    Offset to Lt if the predicate is true, otherwise offset to Lf
 117 4*([k]&0xf) Four times the value of the lower four bits of the byte at
 118             offset k in the packet
 119
 120 Furthermore, the Linux kernel has undocumented BPF filter extensions that can
 121 be found in the virtual machine source code [123]. They are implemented as a
 122 ’hack’ into the instructions ld, ldh and ldb. As negative (or, in unsigned
 123 ’very high’) address offsets cannot be accessed from a network packet, the
 124 following values can then be loaded into the accumulator instead (bpfc syntax
 125 extension):
 126
 127 Mode     Description
 128
 129 #proto   Ethernet protocol
 130 #type    Packet class1 , e.g. Broadcast, Multicast, Outgoing, ...
 131 #ifidx   Network device index the packet was received on
 132 #nla     Load the Netlink attribute of type X (index register)
 133 #nlan    Load the nested Netlink attribute of type X (index register)
 134 #mark    Generic packet mark, i.e. for netfilter
 135 #queue   Queue mapping number for multiqueue devices
 136 #hatype  Network device type2 for ARP protocol hardware identifiers
 137 #rxhash  The packet hash computed on reception
 138 #cpu     CPU number the packet was received on
 139 #vlanp   Loads 1 to the accu if a valid VLAN tag is present
 140 #vlant   Loads the VLAN Tag ID into the accumulator
 141
 142 The #k argument can be in hex (0xff), decimal (256), binary (0b11111111) or
 143 octal (0377) format.
 144
 145 A simple example on how BPF works is demonstrated by retrieving the previous
 146 example of an ARP filter. This time, it is written in BPF language, that bpfc
 147 understands:
 148
 149   ldh [12]           ; this is a comment
 150   jeq #0x806,L1,L2
 151   L1: ret #-1        ; you can also use signed values here
 152   L2: ret #0
 153
 154 Here, the Ethernet type field of the received packet is loaded into the
 155 accumulator first. The next instruction compares the accumulator value with
 156 the absolute value 0x806, which is the Ethernet type for ARP. If both values
 157 are equal, then a jump to the next instruction plus the offset in jt will
 158 occur, otherwise a jump to the next instruction plus the offset in jf is
 159 performed. Since this syntax hides jump offsets, a label for a better
 160 comprehensibility is used instead. Hence, if both values are equal, a jump
 161 to L1, else a jump to L2 is done. Instructions on both labels are return
 162 instructions, thus the virtual machine will be left. The difference of both
 163 lines is the value that is returned. The value states how many bytes of the
 164 network packet should be kept for further processing. Therefore a value of 0
 165 drops the packet entirely and a value larger than 0 keeps and eventually
 166 truncates the last bytes of the packet. Truncating is only done if the length
 167 of the packet is larger than the value that is returned by ret. Else, if the
 168 returned value is larger than the packet size such as 0xfffff Byte, then the
 169 packet is not being truncated by the kernel.
 170
 171 For the BPF language, we have implemented the bpfc compiler in a typical
 172 design that can be found in literature [127] or [128]. There, the sequence of
 173 characters is first translated into a corresponding sequence of symbols of the
 174 vocabulary or also named token of the presented BPF language. Its aim is to
 175 recognize tokens such as opcodes, labels, addressing modes, numbers or other
 176 constructs for later usage. This early phase is called lexical analysis
 177 (figure D.3). Afterwards, the sequence of symbols is transformed into a
 178 representation that directly mirrors the syntactic structure of the source text
 179 and lets this structure easily be recognized [128]. This process is called
 180 syntax parsing and is done with the help of a grammar for the BPF language,
 181 that we have developed. After this phase, the bpfc compiler performs code
 182 generation, which translates the recognized BPF instructions into the numerical
 183 representation that is readable by the kernel.
 184
 185  D.3:
 186   BPF syntax => Tokenizing => Parsing => Code Generation => Numerical Format
 187
 188 We have implemented bpfc in C as an extension for the netsniff-ng toolkit
 189 [119]. The phase of lexical analysis has been realized with flex [129], which
 190 is a fast lexical analyzer. There, all vocabulary of BPF is defined, partly as
 191 regular expressions. flex will then generate code for a lexical scanner that
 192 returns found tokens or aborts on syntax errors. The phase of syntax parsing
 193 is realized with GNU bison [130], which is a Yacc-like parser generator that
 194 cooperates well with flex. Bison then converts a context-free grammar for BPF
 195 into a deterministic LR parser [131] that is implemented in C code. There, a
 196 context-free grammar is a grammar in which every production rule is of nature
 197 M -> w, where M is a single nonterminal symbol and w is (i) a terminal, (ii)
 198 a nonterminal symbol or (iii) a combination of terminals and nonterminals.
 199 In terms of flex and bison, terminals represent defined tokens of the BPF
 200 language and nonterminals are meta-variables used to describe a certain
 201 structure of the language, or how tokens are combined in the syntax. In the
 202 code generation phase, bpfc replaces the parsed instructions by their numerical
 203 representation. This is done in two stages. The first stage is an inline
 204 replacement of opcode and k. Both jump offsets jt and jf are ignored and left
 205 to 0 in the first stage. However, recognized labels are stored in a data
 206 structure for a later retrieval. Then, in the second code generation stage,
 207 BPF jump offsets for jt and jf are calculated.
 208
 209 bpfc also has an optional code validation check. Thus, after code generation,
 210 basic checks are performed on the generated instructions. This includes
 211 checking of jump offsets, so that jumps to non-existent instructions are
 212 prevented, for instance.
 213
 214 In order to use bpfc, the ARP example can be copied into a text file that is
 215 handed over as an command-line argument:
 216
 217   bpfc arp.bpf
 218   { 0x28, 0, 0, 0x0000000c },
 219   { 0x15, 0, 1, 0x00000806 },
 220   { 0x6, 0, 0, 0x000fffff },
 221   { 0x6, 0, 0, 0x00000000 },
 222
 223 Here, bpfc directly outputs the generated code and internally performs code
 224 validation in non-verbose mode. For debugging purposes, bpfc can also be used
 225 verbosely:
 226
 227   bpfc -Vi arp.bpf
 228   *** Generated program:
 229   (000) ldh [12]
 230   (001) jeq #0x806 jt 2 jf 3
 231   (002) ret #1048575
 232   (003) ret #0
 233   *** Validating: is valid!
 234   *** Result:
 235   { 0x28, 0, 0, 0x0000000c },
 236   { 0x15, 0, 1, 0x00000806 },
 237   { 0x6, 0, 0, 0x000fffff },
 238   { 0x6, 0, 0, 0x00000000 },
 239
 240 [31] Copy of original BPF paper: http://netsniff-ng.org/bpf.pdf
 241
 242 Bpfc bindings for Python:
 243 http://carnivore.it/2011/12/27/linux_3.0_bpf_jit_x86_64_exploit