Avoid escape sequences in txt output for man pages
[lartc.git] / autoloadbalance.php3
1 <html>
2 <head>
3 <title>How to do simple loadbalancing with Linux without a single point of failure</title>
4 </head>
5 <body bgcolor=#ffffff>
6 <p>
7 bert hubert &lt;<a href=mailto:ahu@ds9a.nl>ahu@ds9a.nl</a>&gt;
8 <p>
9 Welcome!
11 This page reflects some experiments I did that show promise in
12 providing loadbalancing which can be very interesting in some situations.
13 <p>
14 This is most useful for services which are CPU bound and not network bound.
16 <h2>The goal</h2>
17 Loadbalance a service on one IP address over multiple Linux servers without
18 generating a new single point of failure.
20 <h2>Why</h2>
22 Excellent projects like <a href=http://linux-vs.org>The Linux Virtual
23 Server</a> or machines like the <a href=http://www.alteonwebsystems.com>Alteon
24 Acedirector</a> already provide loadbalancing. However, these all entail
25 either an additional single point of failure, or need the loadbalancing
26 machine itself to be redundantly implemented (ie, two boxes).
27 <p>
28 Doing so is expensive and often not needed. It is however a very good way of
29 scaling to enormous bandwidths - because of the tricks these solutions
30 employ, they are able to do gigabits of traffic.
31 <p>
32 We want to be able to provide loadbalancing for hosts that do not saturate
33 their ethernet, but do need more CPU or IO horsepower than a single box can
34 provide.
36 <H2>Intended audience</h2>
37 Do not interpret this document as a HOWTO. Everything here is very new and
38 very lightly tested. Play around, let me know what happens, but don't
39 complain that your 1024-server deployment just does not do what I promised
40 it would.
41 <p>
42 Even if you are confident that you are savvy enough to fool around, only use
43 what we descibe here if your service is CPU or IO bound, and if you are
44 not saturating your network. If the latter is the case, doing loadbalancing
45 like this will only hurt performance!
47 <h2>How it normally works </h2>
48 <p>
49 We'll assume that you have four servers, to, and
50 that the service you want to provide will live on the virtual IP address
51 We also assume that your subnet is
52 (, and that your default gateway is,
53 which need not be a Linux machine. Furthermore, you are using a hub and not
54 a switch.
56 <br>
57 In ascii art:
59 <pre>
60 [Client]
62 [Internet] - --[HUB]---+---------+-----+-----+
63 default | | | |
64 gateway | | | |
65 11 12 13
66 </pre>
68 Ok - now a customer on the internet wants to access your webserver on
69, and a SYN packet (which starts a TCP/IP session) arrives at
70 your default gateway, which then needs to access a host that feels
71 responsible for
73 <p>
75 In order to find the right host, the router sends out an Address Resolution
76 Protocol (ARP) 'who-has tell'-query. Normally then
77 one of your servers responds with its MAC address '00:10:D7:01:20:11 has
78'. Your router then uses this information to route the SYN
79 packet to the proper MAC address, which is then accepted by your webserver
82 <p>
84 <b>It is vital that you understand this before proceeding!</b> The MAC
85 address can be likened to the address of your building, '12 Router Avenue'.
86 The destination IP address is like the name of your company. The router is
87 the mailperson that stands in your street and shouts 'Where do I deliver
88 mail for Evil Linux Routing Tricks INC?'. Your receptionist would then shout
89 back 'Give it to the people over at 12 Router Avenue', which would prompt
90 the mailperson to deliver mail at that building.
92 <p>
93 Router -> mailperson<br>
94 Destination IP address -> company name<br>
95 MAC Address (also Hardware Address, Ethernet Address) -> house number +
96 street<br>
97 ARP query -> mailperson shouting 'Where do I deliver..'<br>
98 ARP response -> receptionist that replies 'Over at 12 Router Avenue'
99 <h2>How we subvert this for our purposes</h2>
100 Each IP address can have only one MAC address, the router remembers only a
101 single MAC address. So we need to give all our webservers the same MAC
102 address! Yes, this is the icky bit. Also, all webservers need to get an IP
103 alias so they feel resposible for the service we want to offer on
106 This is achieved by executing the following on to 13:
107 <pre>
108 # ip link set eth0 down
109 # ip link set eth0 address 1:0:0:0:0:0
110 # ip link set eth0 up
111 # ip route add default via
112 # ip addr add dev eth0
113 </pre>
115 FIXME: There are MAC addresses reserved for stunts like these, but I haven't
116 yet looked them up - please let me know.
119 The first three commands are self explanatory. The fourth is needed to
120 reestablish the default route that went down together with the interface.
121 The last command then adds to the list of addresses the host
122 feels responsible for.
124 If you execute this remotely, make sure you do so from a script, as you
125 might lose contact after 'ip link set eth0 down'! You might even wish to use
126 'nohup' to make sure your script survives. If you haven't yet tried the
127 wonderful 'ip' tool, please install iproute2 - it is far superior in
128 configuring the kernel than ifconfig and friends are.
130 The new picture:
131 <pre>
132 [Client]
134 [Internet] - --[HUB]---+---------+-----+-----+
135 default | | | |
136 gateway | | | |
137 11 12 13
138 additional: 2 2 2
139 all have same MAC address
140 </pre>
142 What then happens is that the SYN packet for comes along, the
143 router does an ARP query to get the MAC address, and gets 4 identical
144 responses. This in itself is not a problem - it would be neater if only one
145 machine responded, but hey.
148 Now comes the problem. The SYN packet gets transmitted over the network, and
149 again all four machines respond with a SYN|ACK! The router doesn't care
150 about this, it is an IP device and has no clue what a SYN|ACK packet is. So
151 it sends all four packets back to the client that initiated the connection.
154 But the client now does get confused and swiftly drops the connection. Four
155 almost, but not quite, identical SYN|ACK packets is too much to deal with for a
156 simple client.
158 The solution is simple: for each SYN packet, only one host should respond.
159 Now the problem is how to achieve that.
161 <h2>Making sure only one host gets the connection</h2>
162 First concentrate on the SYN packet, then we'll deal with the rest later. The
163 solution is pretty obvious - all machines need to be able to calculate if
164 they want to deal with a connection or not. To do this, we look at the IP
165 address of the client and do some bitfidling on it.
167 First let's do this for two hosts. We want all even IP addresses to go to
168, all odd ones to We do do with the following
169 iptables commands:
170 <pre>
171 []# iptables -A INPUT -d \! -s -j DROP
172 []# iptables -A INPUT -d \! -s -j DROP
173 []# iptables -A INPUT -d -j DROP
174 []# iptables -A INPUT -d -j DROP
175 </pre>
176 The ip addresses between brackets denote on which hosts the commands need to
177 be executed. We expressed the 'even/odd' constraint by using the rather
178 unconventional netmask, '-1' in /-notation.
180 Basically we say 'drop all traffic to unless the source ip
181 address is even' (or odd, in case of More explicitly, 'drop
182 all traffic to if the last bit is/is not 0'.
184 Well, we're nearly there :-) If you now connect from the outside world to
185, depending on the even/oddness of your source IP address, you'll
186 get connected to either or to!
187 <H2>Scaling to four or more hosts</h2>
188 Two is not that interesting because we can, by definition, not deal with the
189 failure of one box, because we started loadbalancing because we needed more
190 horsepower than one machine can deliver.
191 <br>
192 To include all four hosts, we need to look at the last 2 bits of the source
193 IP address. These last two bits have values 1+2=3:
194 <pre>
195 []# iptables -A INPUT -d \! -s -j DROP
196 []# iptables -A INPUT -d \! -s -j DROP
197 []# iptables -A INPUT -d \! -s -j DROP
198 []# iptables -A INPUT -d \! -s -j DROP
199 </pre>
200 This reads like 'drop all traffic to *unless* the last 2 bits of
201 the IP address are {00,01,10,11}'.
203 If you have 8 hosts this starts to look something like this:
204 <pre>
205 []# iptables -A INPUT -d \! -s -j DROP
206 []# iptables -A INPUT -d \! -s -j DROP
207 []# iptables -A INPUT -d \! -s -j DROP
208 (...)
209 []# iptables -A INPUT -d \! -s -j DROP
210 </pre>
211 If your number of servers is not a power of 2, things get lots more
212 interesting! See also the 'Where to go from here' chapter.
213 <h2>Tuning</h2>
214 There are some problems with the setup so far. Most notable:
215 <ul>
216 <li>ICMP traffic that is related to TCP/IP sessions may get delivered to the
217 wrong server as it may have a different source IP address (any router on
218 your path can send ICMP messages!)</li>
219 <li>If you connect to,11,12,13, the other machines with the
220 same MAC address respond with ICMP redirects 'don't send this to me'.</li>
221 <li>Unless you switch off ip forwarding on the hosts, they will even forward
222 the packet right back for you!
223 </ul>
224 Luckily, all these problems can be resolved by expanding our iptables rules
225 a bit, and tweaking some files in /proc.
227 A suggested (and partly untested) set would be:
228 <ol>
229 <li># echo 0 > /proc/sys/net/ipv4/conf/eth0/send_redirects</li>
230 <li># echo 0 > /proc/sys/net/ipv4/ip_forward</li>
232 <li># iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT </li>
233 <li># iptables -A INPUT -m state --state NEW -p tcp -d -s 0.0.0.X/ -j ACCEPT</li>
234 <li># iptables -A INPUT -p udp -d -s 0.0.0.X/ -j ACCEPT</li>
235 <li># iptables -A INPUT -p icmp -d -j DROP</li>
236 <li># iptables -A INPUT -d -j ACCEPT </li>
237 <li># iptables -A INPUT -j DROP</li>
238 </ol>
239 Where X goes from 0 to 3 for the different hosts.
241 This prevents the servers from routing stuff back to the network and enables
242 them to receive TCP and UDP traffic meant for them. All machines receive
243 ICMP traffic for the virtual IP address, but iptables stateful filtering
244 make sure that the kernel stack only sees relevant ICMP messages.
246 We also make sure that traffic to the non-virtual IP address *is* accepted
247 properly. The line by line summary:
249 <OL>
250 <li>Stop the server from sending out redirects for traffic it doesn't want</li>
251 <li>Stop the server from forwarding back traffic it doesn't want</li>
252 <li>Accept already running TCP/IP sessions - this is great for when you
253 change which new connections (even, odd, whatever) you want to accept,
254 without hurting existing ones.</li>
255 <li>Allow new incoming TCP sessions from selected IP addresses</li>
256 <li>Allow incoming UDP packets from selected IP addresses</li>
257 <li>Kill any remaining icmp traffic for the virtual IP - either it already got accepted by the
258 first iptables line ('RELATED'), or it is not for us</li>
259 <li>Accept traffic for our real IP address</li>
260 <li>Drop the rest</li>
261 </OL>
263 If you want your machine to ping back, add this after line 5:
265 <pre>
266 # iptables -A INPUT -p icmp --icmp-type echo-request -j ACCEPT -d -j ACCEPT
267 </pre>
269 <H2>Where to go from here</h2>
270 Besides loadbalancing, you may need redundancy. In order to do so, we need
271 tools that keep the iptables rules in sync over multiple hosts. This hasn't
272 been written yet, but it could be.
274 Such a tool would also calculate and insert the right iptables rules
275 automatically.
276 <H2>And if I have a switch?</H2>
277 Two possible solutions - either configure your switch to act as a hub, or
278 employ additional tricks to confuse the switch so it acts as a hub. The
279 later option entails sending from a different MAC address than the one we
280 listen on. Doing so is, as far as I know, not possible with off the shelf
281 Linux tools. I doubt if it should be.
284 Solutions might be to get netfilter in a position where it can change source
285 MAC addresses on outgoing packets. This should also happen on ARP queries
286 and replies. As far as I know this is a hot item currently.
288 Another solution would be to teach linux that a card can have two addresses, a
289 'listen address' and a 'send address'.
291 I will be discussing this with the relevant people. If you feel that you are
292 one of those people, please contact <a href=mailto:ahu@ds9a.nl>me</a>.
294 <H2>I think you should be locked up!</H2>
295 I admit that having multiple hosts with identical MAC addresses is pretty
296 evil. I also know that there are cleaner solutions. But these all need
297 additional hardware and create new points of failure. I'm not advocating the
298 use of this trick for all services, but it would work *very* well for
299 nameservers. And <a href=http://www.powerdns.com>nameserving</a> is my trade.
301 <H2>Doesn't Microsoft do something like this with W2K?</H2>
302 People tell me so - I have never worked with Windows, so I wouldn't know.
304 <small><center>$Id$</center></small>
305 </html>