1 Filename: 126-geoip-reporting.txt
2 Title: Getting GeoIP data and publishing usage summaries
3 Author: Roger Dingledine
6 Implemented-In: 0.2.0.x
10 In 0.2.0.x, this proposal is implemented to the extent needed to
11 address its motivations. See notes below with the test "RESOLUTION"
14 1. Background and motivation
16 Right now we can keep a rough count of Tor users, both total and by
17 country, by watching connections to a single directory mirror. Being
18 able to get usage estimates is useful both for our funders (to
19 demonstrate progress) and for our own development (so we know how
20 quickly we're scaling and can design accordingly, and so we know which
21 countries and communities to focus on more). This need for information
22 is the only reason we haven't deployed "directory guards" (think of
23 them like entry guards but for directory information; in practice,
24 it would seem that Tor clients should simply use their entry guards
25 as their directory guards; see also proposal 125).
27 With the move toward bridges, we will no longer be able to track Tor
28 clients that use bridges, since they use their bridges as directory
29 guards. Further, we need to be able to learn which bridges stop seeing
30 use from certain countries (and are thus likely blocked), so we can
31 avoid giving them out to other users in those countries.
33 Right now we already do GeoIP lookups in Vidalia: Vidalia draws relays
34 and circuits on its 'network map', and it performs anonymized GeoIP
35 lookups to its central servers to know where to put the dots. Vidalia
36 caches answers it gets -- to reduce delay, to reduce overhead on
37 the network, and to reduce anonymity issues where users reveal their
38 knowledge about the network through which IP addresses they ask about.
40 But with the advent of bridges, Tor clients are asking about IP
41 addresses that aren't in the main directory. In particular, bridge
42 users inform the central Vidalia servers about each bridge as they
43 discover it and their Vidalia tries to map it.
45 Also, we wouldn't mind letting Vidalia do a GeoIP lookup on the client's
46 own IP address, so it can provide a more useful map.
48 Finally, Vidalia's central servers leave users open to partitioning
49 attacks, even if they can't target specific users. Further, as we
50 start using GeoIP results for more operational or security-relevant
51 goals, such as avoiding or including particular countries in circuits,
52 it becomes more important that users can't be singled out in terms of
53 their IP-to-country mapping beliefs.
55 2. The available GeoIP databases
57 There are at least two classes of GeoIP database out there: "IP to
58 country", which tells us the country code for the IP address but
59 no more details, and "IP to city", which tells us the country code,
60 the name of the city, and some basic latitude/longitude guesses.
62 A recent ip-to-country.csv is 3421362 bytes. Compressed, it is 564252
63 bytes. A typical line is:
64 "205500992","208605279","US","USA","UNITED STATES"
65 http://ip-to-country.webhosting.info/node/view/5
67 Similarly, the maxmind GeoLite Country database is also about 500KB
69 http://www.maxmind.com/app/geolitecountry
71 The maxmind GeoLite City database gives more finegrained detail like
72 geo coordinates and city name. Vidalia currently makes use of this
73 information. On the other hand it's 16MB compressed. A typical line is:
74 206.124.149.146,Bellevue,WA,US,47.6051,-122.1134
75 http://www.maxmind.com/app/geolitecity
77 There are other databases out there, like
78 http://www.hostip.info/faq.html
79 http://www.webconfs.com/ip-to-city.php
80 that want more attention, but for now let's assume that all the db's
83 3. What we'd like to solve
85 Goal #1a: Tor relays collect IP-to-country user stats and publish
87 Goal #1b: Tor bridges collect IP-to-country user stats and publish
90 Goal #2a: Vidalia learns IP-to-city stats for Tor relays, for better
92 Goal #2b: Vidalia learns IP-to-country stats for Tor relays, so the user
93 can pick countries for her paths.
95 Goal #3: Vidalia doesn't do external lookups on bridge relay addresses.
97 Goal #4: Vidalia resolves the Tor client's IP-to-country or IP-to-city
100 Goal #5: Reduce partitioning opportunities where Vidalia central
101 servers can give different (distinguishing) responses.
105 Our goal is to allow Tor relays, bridges, and clients to learn enough
106 GeoIP information so they can do local private queries.
108 4.1. The IP-to-country db
110 Directory authorities should publish a "geoip" file that contains
111 IP-to-country mappings. Directory caches will mirror it, and Tor clients
112 and relays (including bridge relays) will fetch it. Thus we can solve
113 goals 1a and 1b (publish sanitized usage info). Controllers could also
114 use this to solve goal 2b (choosing path by country attributes). It
115 also solves goal 4 (learning the Tor client's country), though for
116 huge countries like the US we'd still need to decide where the "middle"
117 should be when we're mapping that address.
119 The IP-to-country details are described further in Sections 5 and
122 [RESOLUTION: The geoip file in 0.2.0.x is not distributed through
123 Tor. Instead, it is shipped with the bundle.]
125 4.2. The IP-to-city db
127 In an ideal world, the IP-to-city db would be small enough that we
128 could distribute it in the above manner too. But for now, it is too
129 large. Here's where the design choice forks.
131 Option A: Vidalia should continue doing its anonymized IP-to-city
132 queries. Thus we can achieve goals 2a and 2b. We would solve goal
133 3 by only doing lookups on descriptors that are purpose "general"
134 (see Section 4.2.1 for how). We would leave goal 5 unsolved.
136 Option B: Each directory authority should keep an IP-to-city db,
137 lookup the value for each router it lists, and include that line in
138 the router's network-status entry. The network-status consensus would
139 then use the line that appears in the majority of votes. This approach
140 also solves goals 2a and 2b, goal 3 (Vidalia doesn't do any lookups
141 at all now), and goal 5 (reduced partitioning risks).
143 Option B has the advantage that Vidalia can simplify its operation,
144 and the advantage that this consensus IP-to-city data is available to
145 other controllers besides just Vidalia. But it has the disadvantage
146 that the networkstatus consensus becomes larger, even though most of
147 the GeoIP information won't change from one consensus to the next. Is
148 there another reasonable location for it that can provide similar
149 consensus security properties?
151 [RESOLUTION: IP-to-city is not supported.]
153 4.2.1. Controllers can query for router annotations
155 Vidalia needs to stop doing queries on bridge relay IP addresses.
156 It could do that by only doing lookups on descriptors that are in
157 the networkstatus consensus, but that precludes designs like Blossom
158 that might want to map its relay locations. The best answer is that it
159 should learn the router annotations, with a new controller 'getinfo'
161 "GETINFO desc-annotations/id/<OR identity>"
162 which would respond with something like
163 @downloaded-at 2007-11-29 08:06:38
164 @source "128.31.0.34"
167 [We could also make the answer include the digest for the router in
168 question, which would enable us to ask GETINFO router-annotations/all.
169 Is this worth it? -RD]
171 Then Vidalia can avoid doing lookups on descriptors with purpose
172 "bridge". Even better would be to add a new annotation "@private true"
173 so Vidalia can know how to handle new purposes that we haven't created
174 yet. Vidalia could special-case "bridge" for now, for compatibility
175 with the current 0.2.0.x-alphas.
179 My overall recommendation is that we should implement 4.1 soon
180 (e.g. early in 0.2.1.x), and we can go with 4.2 option A for now,
181 with the hope that later we discover a better way to distribute the
182 IP-to-city info and can switch to 4.2 option B.
184 Below we discuss more how to go about achieving 4.1.
186 5. Publishing and caching the GeoIP (IP-to-country) database
188 Each v3 directory authority should put a copy of the "geoip" file in
189 its datadirectory. Then its network-status votes should include a hash
190 of this file (Recommended-geoip-hash: %s), and the resulting consensus
191 directory should specify the consensus hash.
193 There should be a new URL for fetching this geoip db (by "current.z"
194 for testing purposes, and by hash.z for typical downloads). Authorities
195 should fetch and serve the one listed in the consensus, even when they
196 vote for their own. This would argue for storing the cached version
197 in a better filename than "geoip".
199 Directory mirrors should keep a copy of this file available via the
202 We assume that the file would change at most a few times a month. Should
203 Tor ship with a bootstrap geoip file? An out-of-date geoip file may
204 open you up to partitioning attacks, but for the most part it won't
207 There should be a config option to disable updating the geoip file,
208 in case users want to use their own file (e.g. they have a proprietary
209 GeoIP file they prefer to use). In that case we leave it up to the
210 user to update his geoip file out-of-band.
212 [XXX Should consider forward/backward compatibility, e.g. if we want
213 to move to a new geoip file format. -RD]
215 [RESOLUTION: Not done over Tor.]
217 6. Controllers use the IP-to-country db for mapping and for path building
219 Down the road, Vidalia could use the IP-to-country mappings for placing
221 - The location of the client
222 - The location of the bridges, or other relays not in the
223 networkstatus, on the map.
224 - Any relays that it doesn't yet have an IP-to-city answer for.
226 Other controllers can also use it to set EntryNodes, ExitNodes, etc
227 in a per-country way.
229 To support these features, we need to export the IP-to-country data
230 via the Tor controller protocol.
232 Is it sufficient just to add a new GETINFO command?
233 GETINFO ip-to-country/128.31.0.34
234 250+ip-to-country/128.31.0.34="US","USA","UNITED STATES"
236 [RESOLUTION: Not done now, except for the getinfo command.]
238 6.1. Other interfaces
240 Robert Hogan has also suggested a
242 GETINFO relays-by-country/cn
244 as well as torrc options for ExitCountryCodes, EntryCountryCodes,
245 ExcludeCountryCodes, etc.
247 [RESOLUTION: Not implemented in 0.2.0.x. Fodder for a future proposal.]
249 7. Relays and bridges use the IP-to-country db for usage summaries
251 Once bridges have a GeoIP database locally, they can start to publish
252 sanitized summaries of client usage -- how many users they see and from
253 what countries. This might also be a more useful way for ordinary Tor
254 relays to convey the level of usage they see, which would allow us to
255 switch to using directory guards for all users by default.
257 But how to safely summarize this information without opening too many
260 7.1 Attacks to think about
262 First, note that we need to have a large enough time window that we're
263 not aiding correlation attacks much. I hope 24 hours is enough. So
264 that means no publishing stats until you've been up at least 24 hours.
265 And you can't publish follow-up stats more often than every 24 hours,
266 or people could look at the differential.
268 Second, note that we need to be sufficiently vague about the IP
269 addresses we're reporting. We are hoping that just specifying the
270 country will be vague enough. But a) what about active attacks where
271 we convince a bridge to use a GeoIP db that labels each suspect IP
272 address as a unique country? We have to assume that the consensus GeoIP
273 db won't be malicious in this way. And b) could such singling-out
274 attacks occur naturally, for example because of countries that have
275 a very small IP space? We should investigate that.
277 7.2. Granularity of users
279 Do we only want to report countries that have a sufficient anonymity set
280 (that is, number of users) for the day? For example, we might avoid
281 listing any countries that have seen less than five addresses over
282 the 24 hour period. This approach would be helpful in reducing the
283 singling-out opportunities -- in the extreme case, we could imagine a
284 situation where one blogger from the Sudan used Tor on a given day, and
285 we can discover which entry guard she used.
287 But I fear that especially for bridges, seeing only one hit from a
288 given country in a given day may be quite common.
290 As a compromise, we should start out with an "Other" category in
291 the reported stats, which is the sum of unlisted countries; if that
292 category is consistently interesting, we can think harder about how
293 to get the right data from it safely.
295 But note that bridge summaries will not be made public individually,
296 since doing so would help people enumerate bridges. Whereas summaries
297 from normal relays will be public. So perhaps that means we can afford
298 to be more specific in bridge summaries? In particular, I'm thinking the
299 "other" category should be used by public relays but not for bridges
300 (or if it is, used with a lower threshold).
302 Even for countries that have many Tor users, we might not want to be
303 too specific about how many users we've seen. For example, we might
304 round down the number of users we report to the nearest multiple of 5.
305 My instinct for now is that this won't be that useful.
309 Another note: we'll likely be overreporting in the case of users with
310 dynamic IP addresses: if they rotate to a new address over the course
311 of the day, we'll count them twice. So be it.
313 7.4. Where to publish the summaries?
315 We designed extrainfo documents for information like this. So they
316 should just be more entries in the extrainfo doc.
318 But if we want to publish summaries every 24 hours (no more often,
319 no less often), aren't we tried to the router descriptor publishing
320 schedule? That is, if we publish a new router descriptor at the 18
321 hour mark, and nothing much has changed at the 24 hour mark, won't
322 the new descriptor get dropped as being "cosmetically similar", and
323 then nobody will know to ask about the new extrainfo document?
325 One solution would be to make and remember the 24 hour summary at the
326 24 hour mark, but not actually publish it anywhere until we happen to
327 publish a new descriptor for other reasons. If we happen to go down
328 before publishing a new descriptor, then so be it, at least we tried.
330 7.5. What if the relay is unreachable or goes to sleep?
332 Even if you've been up for 24 hours, if you were hibernating for 18
333 of them, then we're not getting as much fuzziness as we'd like. So
334 I guess that means that we need a 24-hour period of being "awake"
335 before we'll willing to publish a summary. A similar attack works if
336 you've been awake but unreachable for the first 18 of the 24 hours. As
337 another example, a bridge that's on a laptop might be suspended for
340 This implies that some relays and bridges will never publish summary
341 stats, because they're not ever reliably working for 24 hours in
342 a row. If a significant percentage of our reporters end up being in
343 this boat, we should investigate whether we can accumulate 24 hours of
344 "usefulness", even if there are holes in the middle, and publish based
347 What other issues are like this? It seems that just moving to a new
348 IP address shouldn't be a reason to cancel stats publishing, assuming
349 we were usable at each address.
351 7.6. IP addresses that aren't in the geoip db
353 Some IP addresses aren't in the public geoip databases. In particular,
354 I've found that a lot of African countries are missing, but there
355 are also some common ones in the US that are missing, like parts of
356 Comcast. We could just lump unknown IP addresses into the "other"
357 category, but it might be useful to gather a general sense of how many
358 lookups are failing entirely, by adding a separate "Unknown" category.
360 We could also contribute back to the geoip db, by letting bridges set
361 a config option to report the actual IP addresses that failed their
362 lookup. Then the bridge authority operators can manually make sure
363 the correct answer will be in later geoip files. This config option
364 should be disabled by default.
366 7.7 Bringing it all together
370 24 hours after starting up (modulo Section 7.5 above), bridges and
371 relays should construct a daily summary of client countries they've
372 seen, including the above "Unknown" category (Section 7.6) as well.
374 Non-bridge relays lump all countries with less than K (e.g. K=5) users
375 into the "Other" category (see Sec 7.2 above), whereas bridge relays are
376 willing to list a country even when it has only one user for the day.
378 Whenever we have a daily summary on record, we include it in our
379 extrainfo document whenever we publish one. The daily summary we
380 remember locally gets replaced with a newer one when another 24
383 7.8. Some forward secrecy
385 How should we remember addresses locally? If we convert them into
386 country-codes immediately, we will count them again if we see them
387 again. On the other hand, we don't really want to keep a list hanging
388 around of all IP addresses we've seen in the past 24 hours.
390 Step one is that we should never write this stuff to disk. Keeping it
391 only in ram will make things somewhat better. Step two is to avoid
392 keeping any timestamps associated with it: rather than a rolling
393 24-hour window, which would require us to remember the various times
394 we've seen that address, we can instead just throw out the whole list
395 every 24 hours and start over.
397 We could hash the addresses, and then compare hashes when deciding if
398 we've seen a given address before. We could even do keyed hashes. Or
399 Bloom filters. But if our goal is to defend against an adversary
400 who steals a copy of our ram while we're running and then does
401 guess-and-check on whatever blob we're keeping, we're in bad shape.
403 We could drop the last octet of the IP address as soon as we see
404 it. That would cause us to undercount some users from cablemodem and
405 DSL networks that have a high density of Tor users. And it wouldn't
406 really help that much -- indeed, the extent to which it does help is
407 exactly the extent to which it makes our stats less useful.