1 Filename: 141-jit-sd-downloads.txt
2 Title: Download server descriptors on demand
5 Author: Peter Palfrader
11 Downloading all server descriptors is the most expensive part
12 of bootstrapping a Tor client. These server descriptors currently
13 amount to about 1.5 Megabytes of data, and this size will grow
14 linearly with network size.
16 Fetching all these server descriptors takes a long while for people
17 behind slow network connections. It is also a considerable load on
18 our network of directory mirrors.
20 This document describes proposed changes to the Tor network and
21 directory protocol so that clients will no longer need to download
22 all server descriptors.
24 These changes consist of moving load balancing information into
25 network status documents, implementing a means to download server
26 descriptors on demand in an anonymity-preserving way, and dealing
27 with exit node selection.
29 2. What is in a server descriptor
31 When a Tor client starts the first thing it will try to get is a
32 current network status document: a consensus signed by a majority
33 of directory authorities. This document is currently about 100
34 Kilobytes in size, tho it will grow linearly with network size.
35 This document lists all servers currently running on the network.
36 The Tor client will then try to get a server descriptor for each
37 of the running servers. All server descriptors currently amount
38 to about 1.5 Megabytes of downloads.
40 A Tor client learns several things about a server from its descriptor.
41 Some of these it already learned from the network status document
42 published by the authorities, but the server descriptor contains it
43 again in a single statement signed by the server itself, not just by
44 the directory authorities.
46 Tor clients use the information from server descriptors for
47 different purposes, which are considered in the following sections.
49 #three ways: One, to determine if a server will be able to handle
50 #this client's request; two, to actually communicate or use the server;
51 #three, for load balancing decisions.
53 #These three points are considered in the following subsections.
57 The Tor load balancing mechanism is quite complex in its details, but
58 it has a simple goal: The more traffic a server can handle the more
59 traffic it should get. That means the more traffic a server can
60 handle the more likely a client will use it.
62 For this purpose each server descriptor has bandwidth information
63 which tries to convey a server's capacity to clients.
65 Currently we weigh servers differently for different purposes. There
66 is a weigh for when we use a server as a guard node (our entry to the
67 Tor network), there is one weigh we assign servers for exit duties,
68 and a third for when we need intermediate (middle) nodes.
72 When a Tor wants to exit to some resource on the internet it will
73 build a circuit to an exit node that allows access to that resource's
74 IP address and TCP Port.
76 When building that circuit the client can make sure that the circuit
77 ends at a server that will be able to fulfill the request because the
78 client already learned of all the servers' exit policies from their
81 2.3 Capability information
83 Server descriptors contain information about the specific version or
84 the Tor protocol they understand [proposal 105].
86 Furthermore the server descriptor also contains the exact version of
87 the Tor software that the server is running and some decisions are
88 made based on the server version number (for instance a Tor client
89 will only make conditional consensus requests [proposal 139] when
90 talking to Tor servers version 0.2.1.1-alpha or later).
92 2.4 Contact/key information
94 A server descriptor lists a server's IP address and TCP ports on which
95 it accepts onion and directory connections. Furthermore it contains
96 the onion key (a short lived RSA key to which clients encrypt CREATE
99 2.5 Identity information
101 A Tor client learns the digest of a server's key from the network
102 status document. Once it has a server descriptor this descriptor
103 contains the full RSA identity key of the server. Clients verify
104 that 1) the digest of the identity key matches the expected digest
105 it got from the consensus, and 2) that the signature on the descriptor
106 from that key is valid.
109 3. No longer require clients to have copies of all SDs
111 3.1 Load balancing info in consensus documents
113 One of the reasons why clients download all server descriptors is for
114 doing load proper load balancing as described in 2.1. In order for
115 clients to not require all server descriptors this information will
116 have to move into the network status document.
118 Consensus documents will have a new line per router similar
119 to the "r", "s", and "v" lines that already exist. This line
120 will convey weight information to clients.
124 The bandwidth number is the lesser of observed bandwidth and bandwidth
125 rate limit from the server descriptor that the "r" line referenced by
126 digest (1st and 3rd field of the bandwidth line in the descriptor).
127 It is given in kilobytes per second so the byte value in the
128 descriptor has to be divided by 1024 (and is then truncated, i.e.
131 Authorities will cap the bandwidth number at some arbitrary value,
132 currently 10MB/sec. If a router claims a larger bandwidth an
133 authority's vote will still only show Bandwidth=10240.
135 The consensus value for bandwidth is the median of all bandwidth
136 numbers given in votes. In case of an even number of votes we use
137 the lower median. (Using this procedure allows us to change the
138 cap value more easily.)
140 Clients should believe the bandwidth as presented in the consensus,
141 not capping it again.
143 3.2 Fetching descriptors on demand
145 As described in 2.4 a descriptor lists IP address, OR- and Dir-Port,
146 and the onion key for a server.
148 A client already knows the IP address and the ports from the consensus
149 documents, but without the onion key it will not be able to send
150 CREATE/EXTEND cells for that server. Since the client needs the onion
151 key it needs the descriptor.
153 If a client only downloaded a few descriptors in an observable manner
154 then that would leak which nodes it was going to use.
156 This proposal suggests the following:
158 1) when connecting to a guard node for which the client does not
159 yet have a cached descriptor it requests the descriptor it
160 expects by hash. (The consensus document that the client holds
161 has a hash for the descriptor of this server. We want exactly
162 that descriptor, not a different one.)
164 It does that by sending a RELAY_REQUEST_SD cell.
166 A client MAY cache the descriptor of the guard node so that it does
167 not need to request it every single time it contacts the guard.
169 2) when a client wants to extend a circuit that currently ends in
170 server B to a new next server C, the client will send a
171 RELAY_REQUEST_SD cell to server B. This cell contains in its
172 payload the hash of a server descriptor the client would like
173 to obtain (C's server descriptor). The server sends back the
174 descriptor and the client can now form a valid EXTEND/CREATE cell
175 encrypted to C's onion key.
177 Clients MUST NOT cache such descriptors. If they did they might
178 leak that they already extended to that server at least once
181 Replies to RELAY_REQUEST_SD requests need to be padded to some
182 constant upper limit in order to conceal a client's destination
183 from anybody who might be counting cells/bytes.
185 RELAY_REQUEST_SD cells contain the following information:
186 - hash of the server descriptor requested
187 - hash of the identity digest of the server for which we want the SD
188 - IP address and OR-port or the server for which we want the SD
189 - padding factor - the number of cells we want the answer
191 [XXX this just occured to me and it might be smart. or it might
192 be stupid. clients would learn the padding factor they want
193 to use from the consensus document. This allows us to grow
194 the replies later on should SDs become larger.]
195 [XXX: figure out a decent padding size]
197 3.3 Protocol versions
199 Server descriptors contain optional information of supported
200 link-level and circuit-level protocols in the form of
201 "opt protocols Link 1 2 Circuit 1". These are not currently needed
202 and will probably eventually move into the "v" (version) line in
203 the consensus. This proposal does not deal with them.
205 Similarly a server descriptor contains the version number of
206 a Tor node. This information is already present in the consensus
207 and is thus available to all clients immediately.
211 Currently finding an appropriate exit node for a user's request is
212 easy for a client because it has complete knowledge of all the exit
213 policies of all servers on the network.
215 The consensus document will once again be extended to contain the
216 information required by clients. This information will be a summary
217 of each node's exit policy. The exit policy summary will only contain
218 the list of ports to which a node exits to most destination IP
221 A summary should claim a router exits to a specific TCP port if,
222 ignoring private IP addresses, the exit policy indicates that the
223 router would exit to this port to most IP address. either two /8
224 netblocks, or one /8 and a couple of /12s or any other combination).
225 The exact algorith used is this: Going through all exit policy items
226 - ignore any accept that is not for all IP addresses ("*"),
227 - ignore rejects for these netblocks (exactly, no subnetting):
228 0.0.0.0/8, 169.254.0.0/16, 127.0.0.0/8, 192.168.0.0/16, 10.0.0.0/8,
230 - for each reject count the number of IP addresses rejected against
232 - once we hit an accept for all IP addresses ("*") add the ports in
233 that policy item to the list of accepted ports, if they don't have
234 more than 2^25 IP addresses (that's two /8 networks) counted
235 against them (i.e. if the router exits to a port to everywhere but
236 at most two /8 networks).
238 An exit policy summary will be included in votes and consensus as a
239 new line attached to each exit node. The line will have the format
240 "p" <space> "accept"|"reject" <portlist>
241 where portlist is a comma seperated list of single port numbers or
242 portranges (e.g. "22,80-88,1024-6000,6667").
244 Whether the summary shows the list of accepted ports or the list of
245 rejected ports depends on which list is shorter (has a shorter string
246 representation). In case of ties we choose the list of accepted
247 ports. As an exception to this rule an allow-all policy is
248 represented as "accept 1-65535" instead of "reject " and a reject-all
249 policy is similarly given as "reject 1-65535".
251 Summary items are compressed, that is instead of "80-88,89-100" there
252 only is a single item of "80-100", similarly instead of "20,21" a
253 summary will say "20-21".
255 Port lists are sorted in ascending order.
257 The maximum allowed length of a policy summary (including the "accept "
258 or "reject ") is 1000 characters. If a summary exceeds that length we
259 use an accept-style summary and list as much of the port list as is
260 possible within these 1000 bytes.
262 3.4.1 Consensus selection
264 When building a consensus, authorities have to agree on a digest of
265 the server descriptor to list in the router line for each router.
266 This is documented in dir-spec section 3.4.
268 All authorities that listed that agreed upon descriptor digest in
269 their vote should also list the same exit policy summary - or list
270 none at all if the authority has not been upgraded to list that
271 information in their vote.
273 If we have votes with matching server descriptor digest of which at
274 least one of them has an exit policy then we differ between two cases:
275 a) all authorities agree (or abstained) on the policy summary, and we
276 use the exit policy summary that they all listed in their vote,
277 b) something went wrong (or some authority is playing foul) and we
278 have different policy summaries. In that case we pick the one
279 that is most commonly listed in votes with the matching
280 descriptor. We break ties in favour of the lexigraphically larger
283 If none one of the votes with a matching server descriptor digest has
284 an exit policy summary we use the most commonly listed one in all
285 votes, breaking ties like in case b above.
287 3.4.2 Client behaviour
289 When choosing an exit node for a specific request a Tor client will
290 choose from the list of nodes that exit to the requested port as given
291 by the consensus document. If a client has additional knowledge (like
292 cached full descriptors) that indicates the so chosen exit node will
293 reject the request then it MAY use that knowledge (or not include such
294 nodes in the selection to begin with). However, clients MUST NOT use
295 nodes that do not list the port as accepted in the summary (but for
296 which they know that the node would exit to that address from other
297 sources, like a cached descriptor).
299 An exception to this is exit enclave behaviour: A client MAY use the
300 node at a specific IP address to exit to any port on the same address
301 even if that node is not listed as exiting to the port in the summary.
305 4.1 Consensus document changes.
307 The consensus will need to include
308 - bandwidth information (see 3.1)
309 - exit policy summaries (3.4)
311 A new consensus method (number TBD) will be chosen for this.
313 5. Future possibilities
315 This proposal still requires that all servers have the descriptors of
316 every other node in the network in order to answer RELAY_REQUEST_SD
317 cells. These cells are sent when a circuit is extended from ending at
318 node B to a new node C. In that case B would have to answer a
319 RELAY_REQUEST_SD cell that asks for C's server descriptor (by SD digest).
321 In order to answer that request B obviously needs a copy of C's server
322 descriptor. The RELAY_REQUEST_SD cell already has all the info that
323 B needs to contact C so it can ask about the descriptor before passing it