Documentation/Performance

   1 Hitchhiker's guide to high-performance with netsniff-ng:
   2 ////////////////////////////////////////////////////////
   3
   4 This is a collection of short notes in random order concerning software
   5 and hardware for optimizing throughput (partly copied or derived from sources
   6 that are mentioned at the end of this file):
   7
   8 <=== Hardware ====>
   9
  10 .-=> Use a PCI-X or PCIe server NIC
  11 `--------------------------------------------------------------------------
  12 Only if it says Gigabit Ethernet on the box of your NIC, that does not
  13 necessarily mean that you will also reach it. Especially on small packet
  14 sizes, you won't reach wire-rate with a PCI adapter built for desktop or
  15 consumer machines. Rather, you should buy a server adapter that has faster
  16 interconnects such as PCIe. Also, make your choice of a server adapter,
  17 whether it has a good support in the kernel. Check the Linux drivers
  18 directory for your targeted chipset and look at the netdev list if the adapter
  19 is updated frequently. Also, check the location/slot of the NIC adapter on
  20 the system motherboard: Our experience resulted in significantly different
  21 measurement values by locating the NIC adapter in different PCIe slots.
  22 Since we did not have schematics for the system motherboard, this was a
  23 trial and error effort. Moreover, check the specifications of the NIC
  24 hardware: is the system bus connector I/O capable of Gigabit Ethernet
  25 frame rate throughput? Also check the network topology: is your network
  26 Gigabit switch capable of switching Ethernet frames at the maximum rate
  27 or is a direct connection of two end-nodes the better solution? Is Ethernet
  28 flow control being used? "ethtool -a eth0" can be used to determine this.
  29 For measurement purposes, you might want to turn it off to increase throughput:
  30   * ethtool -A eth0 autoneg off
  31   * ethtool -A eth0 rx off
  32   * ethtool -A eth0 tx off
  33
  34 .-=> Use better (faster) hardware
  35 `--------------------------------------------------------------------------
  36 Before doing software-based fine-tuning, check if you can afford better and
  37 especially faster hardware. For instance, get a fast CPU with lots of cores
  38 or a NUMA architecture with multi-core CPUs and a fast interconnect. If you
  39 dump PCAP files to disc with netsniff-ng, then a fast SSD is appropriate.
  40 If you plan to memory map PCAP files with netsniff-ng, then choose an
  41 appropriate amount of RAM and so on and so forth.
  42
  43 <=== Software (Linux kernel specific) ====>
  44
  45 .-=> Use NAPI drivers
  46 `--------------------------------------------------------------------------
  47 The "New API" (NAPI) is a rework of the packet processing code in the
  48 kernel to improve performance for high speed networking. NAPI provides
  49 two major features:
  50
  51 Interrupt mitigation: High-speed networking can create thousands of
  52 interrupts per second, all of which tell the system something it already
  53 knew: it has lots of packets to process. NAPI allows drivers to run with
  54 (some) interrupts disabled during times of high traffic, with a
  55 corresponding decrease in system load.
  56
  57 Packet throttling: When the system is overwhelmed and must drop packets,
  58 it's better if those packets are disposed of before much effort goes into
  59 processing them. NAPI-compliant drivers can often cause packets to be
  60 dropped in the network adaptor itself, before the kernel sees them at all.
  61
  62 Many recent NIC drivers automatically support NAPI, so you don't need to do
  63 anything. Some drivers need you to explicitly specify NAPI in the kernel
  64 config or on the command line when compiling the driver. If you are unsure,
  65 check your driver documentation.
  66
  67 .-=> Use a tickless kernel
  68 `--------------------------------------------------------------------------
  69 The tickless kernel feature allows for on-demand timer interrupts. This
  70 means that during idle periods, fewer timer interrupts will fire, which
  71 should lead to power savings, cooler running systems, and fewer useless
  72 context switches. (Kernel option: CONFIG_NO_HZ=y)
  73
  74 .-=> Reduce timer interrupts
  75 `--------------------------------------------------------------------------
  76 You can select the rate at which timer interrupts in the kernel will fire.
  77 When a timer interrupt fires on a CPU, the process running on that CPU is
  78 interrupted while the timer interrupt is handled. Reducing the rate at
  79 which the timer fires allows for fewer interruptions of your running
  80 processes. This option is particularly useful for servers with multiple
  81 CPUs where processes are not running interactively. (Kernel options:
  82 CONFIG_HZ_100=y and CONFIG_HZ=100)
  83
  84 .-=> Use Intel's I/OAT DMA Engine
  85 `--------------------------------------------------------------------------
  86 This kernel option enables the Intel I/OAT DMA engine that is present in
  87 recent Xeon CPUs. This option increases network throughput as the DMA
  88 engine allows the kernel to offload network data copying from the CPU to
  89 the DMA engine. This frees up the CPU to do more useful work.
  90
  91 Check to see if it's enabled:
  92
  93 [foo@bar]% dmesg | grep ioat
  94 ioatdma 0000:00:08.0: setting latency timer to 64
  95 ioatdma 0000:00:08.0: Intel(R) I/OAT DMA Engine found, 4 channels, [...]
  96 ioatdma 0000:00:08.0: irq 56 for MSI/MSI-X
  97
  98 There's also a sysfs interface where you can get some statistics about the
  99 DMA engine. Check the directories under /sys/class/dma/. (Kernel options:
 100 CONFIG_DMADEVICES=y and CONFIG_INTEL_IOATDMA=y and CONFIG_DMA_ENGINE=y and
 101 CONFIG_NET_DMA=y and CONFIG_ASYNC_TX_DMA=y)
 102
 103 .-=> Use Direct Cache Access (DCA)
 104 `--------------------------------------------------------------------------
 105 Intel's I/OAT also includes a feature called Direct Cache Access (DCA).
 106 DCA allows a driver to warm a CPU cache. A few NICs support DCA, the most
 107 popular (to my knowledge) is the Intel 10GbE driver (ixgbe). Refer to your
 108 NIC driver documentation to see if your NIC supports DCA. To enable DCA,
 109 a switch in the BIOS must be flipped. Some vendors supply machines that
 110 support DCA, but don't expose a switch for DCA.
 111
 112 You can check if DCA is enabled:
 113
 114 [foo@bar]% dmesg | grep dca
 115 dca service started, version 1.8
 116
 117 If DCA is possible on your system but disabled you'll see:
 118
 119 ioatdma 0000:00:08.0: DCA is disabled in BIOS
 120
 121 Which means you'll need to enable it in the BIOS or manually. (Kernel
 122 option: CONFIG_DCA=y)
 123
 124 .-=> Throttle NIC Interrupts
 125 `--------------------------------------------------------------------------
 126 Some drivers allow the user to specify the rate at which the NIC will
 127 generate interrupts. The e1000e driver allows you to pass a command line
 128 option InterruptThrottleRate when loading the module with insmod. For
 129 the e1000e there are two dynamic interrupt throttle mechanisms, specified
 130 on the command line as 1 (dynamic) and 3 (dynamic conservative). The
 131 adaptive algorithm traffic into different classes and adjusts the interrupt
 132 rate appropriately. The difference between dynamic and dynamic conservative
 133 is the rate for the 'Lowest Latency' traffic class, dynamic (1) has a much
 134 more aggressive interrupt rate for this traffic class.
 135
 136 As always, check your driver documentation for more information.
 137
 138 With modprobe: insmod e1000e.o InterruptThrottleRate=1
 139
 140 .-=> Use Process and IRQ affinity
 141 `--------------------------------------------------------------------------
 142 Linux allows the user to specify which CPUs processes and interrupt
 143 handlers are bound.
 144
 145 Processes: You can use taskset to specify which CPUs a process can run on
 146 Interrupt Handlers: The interrupt map can be found in /proc/interrupts, and
 147 the affinity for each interrupt can be set in the file smp_affinity in the
 148 directory for each interrupt under /proc/irq/.
 149
 150 This is useful because you can pin the interrupt handlers for your NICs
 151 to specific CPUs so that when a shared resource is touched (a lock in the
 152 network stack) and loaded to a CPU cache, the next time the handler runs,
 153 it will be put on the same CPU avoiding costly cache invalidations that
 154 can occur if the handler is put on a different CPU.
 155
 156 However, reports of up to a 24% improvement can be had if processes and
 157 the IRQs for the NICs the processes get data from are pinned to the same
 158 CPUs. Doing this ensures that the data loaded into the CPU cache by the
 159 interrupt handler can be used (without invalidation) by the process;
 160 extremely high cache locality is achieved.
 161
 162 NOTE: If netsniff-ng or trafgen is bound to a specific, it automatically
 163 migrates the NIC's IRQ affinity to this CPU to achieve a high cache locality.
 164
 165 .-=> Tune Socket's memory allocation area
 166 `--------------------------------------------------------------------------
 167 On default, each socket has a backend memory between 130KB and 160KB on
 168 a x86/x86_64 machine with 4GB RAM. Hence, network packets can be received
 169 on the NIC driver layer, but later dropped at the socket queue due to memory
 170 restrictions. "sysctl -a | grep mem" will display your current memory
 171 settings. To increase maximum and default values of read and write memory
 172 areas, use:
 173    * sysctl -w net.core.rmem_max=8388608
 174      This sets the max OS receive buffer size for all types of connections.
 175    * sysctl -w net.core.wmem_max=8388608
 176      This sets the max OS send buffer size for all types of connections.
 177    * sysctl -w net.core.rmem_default=65536
 178      This sets the default OS receive buffer size for all types of connections.
 179    * sysctl -w net.core.wmem_default=65536
 180      This sets the default OS send buffer size for all types of connections.
 181
 182 .-=> Enable Linux' BPF Just-in-Time compiler
 183 `--------------------------------------------------------------------------
 184 If you're using filtering with netsniff-ng (or tcpdump, Wireshark, ...), you
 185 should activate the Berkeley Packet Filter Just-in-Time compiler. The Linux
 186 kernel has a built-in "virtual machine" that interprets BPF opcodes for
 187 filtering packets. Hence, those small filter applications are applied to
 188 each packet. (Read more about this in the Bpfc document.) The Just-in-Time
 189 compiler is able to 'compile' such an filter application to assembler code
 190 that can directly be run on the CPU instead on the virtual machine. If
 191 netsniff-ng or trafgen detects that the BPF JIT is present on the system, it
 192 automatically enables it. (Kernel option: CONFIG_HAVE_BPF_JIT=y and
 193 CONFIG_BPF_JIT=y)
 194
 195 .-=> Increase the TX queue length
 196 `--------------------------------------------------------------------------
 197 There are settings available to regulate the size of the queue between the
 198 kernel network subsystems and the driver for network interface card. Just
 199 as with any queue, it is recommended to size it such that losses do no
 200 occur due to local buffer overflows. Therefore careful tuning is required
 201 to ensure that the sizes of the queues are optimal for your network
 202 connection.
 203
 204 There are two queues to consider, the txqueuelen; which is related to the
 205 transmit queue size, and the netdev_backlog; which determines the recv
 206 queue size. Users can manually set this queue size using the ifconfig
 207 command on the required device:
 208
 209 ifconfig eth0 txqueuelen 2000
 210
 211 The default of 100 is inadequate for long distance, or high throughput pipes.
 212 For example, on a network with a rtt of 120ms and at Gig rates, a
 213 txqueuelen of at least 10000 is recommended.
 214
 215 .-=> Increase kernel receiver backlog queue
 216 `--------------------------------------------------------------------------
 217 For the receiver side, we have a similar queue for incoming packets. This
 218 queue will build up in size when an interface receives packets faster than
 219 the kernel can process them. If this queue is too small (default is 300),
 220 we will begin to loose packets at the receiver, rather than on the network.
 221 One can set this value by:
 222
 223 sysctl -w net.core.netdev_max_backlog=2000
 224
 225 .-=> Use a RAM-based filesystem if possible
 226 `--------------------------------------------------------------------------
 227 If you have a considerable amount of RAM, you can also think of using a
 228 RAM-based file system such as ramfs for dumping pcap files with netsniff-ng.
 229 This can be useful for small until middle-sized pcap sizes or for pcap probes
 230 that are generated with netsniff-ng.
 231
 232 <=== Software (netsniff-ng / trafgen specific) ====>
 233
 234 .-=> Bind netsniff-ng / trafgen to a CPU
 235 `--------------------------------------------------------------------------
 236 Both tools have a command-line option '--bind-cpu' that can be used like
 237 '--bind-cpu 0' in order to pin the process to a specific CPU. This was
 238 already mentioned earlier in this file. However, netsniff-ng and trafgen are
 239 able to do this without an external tool. Next to this CPU pinning, they also
 240 automatically migrate this CPU's NIC IRQ affinity. Hence, as in '--bind-cpu 0'
 241 netsniff-ng will not be migrated to a different CPU and the NIC's IRQ affinity
 242 will also be moved to CPU 0 to increase cache locality.
 243
 244 .-=> Use netsniff-ng in silent mode
 245 `--------------------------------------------------------------------------
 246 Don't print information to the konsole while you want to achieve high-speed,
 247 because this highly slows down the application. Hence, use netsniff-ng's
 248 '--silent' option when recording or replaying PCAP files!
 249
 250 .-=> Use netsniff-ng's scatter/gather or mmap for PCAP files
 251 `--------------------------------------------------------------------------
 252 The scatter/gather I/O mode which is default in netsniff-ng can be used to
 253 record large PCAP files and is slower than the memory mapped I/O. However,
 254 you don't have the RAM size as your limit for recording. Use netsniff-ng's
 255 memory mapped I/O option for achieving a higher speed for recording a PCAP,
 256 but with the trade-off that the maximum allowed size is limited.
 257
 258 .-=> Use static packet configurations in trafgen
 259 `--------------------------------------------------------------------------
 260 Don't use counters or byte randomization in trafgen configuration file, since
 261 it slows down the packet generation process. Static packet bytes are the fastest
 262 to go with.
 263
 264 .-=> For large packets, compile the toolkit with MMX/SSE memcpy
 265 `--------------------------------------------------------------------------
 266 If you have a higher portion of large packets rather than small ones, you
 267 could compile the toolkit with an MMX/SSE optimized memcpy, which is available
 268 in src/opt_memcpy.c for x86/x86_64 architectures. Have a look at the toolkit's
 269 install instructions for compiling.
 270
 271 <=== Further things worth experimenting regarding performance ====>
 272
 273 - netsniff-ng/trafgen on Ethernet team devices
 274   (http://lingrok.org/xref/linux-linus/drivers/net/team/)
 275 - netsniff-ng/trafgen on virtual tunnel devices
 276
 277 Sources:
 278 ~~~~~~~~
 279
 280 * http://www.linuxfoundation.org/collaborate/workgroups/networking/napi
 281 * http://datatag.web.cern.ch/datatag/howto/tcp.html
 282 * http://thread.gmane.org/gmane.linux.network/191115
 283 * http://bit.ly/3XbBrM
 284 * http://wwwx.cs.unc.edu/~sparkst/howto/network_tuning.php
 285 * http://bit.ly/pUFJxU