proposals/126-geoip-reporting.txt

   1 Filename: 126-geoip-reporting.txt
   2 Title: Getting GeoIP data and publishing usage summaries
   3 Author: Roger Dingledine
   4 Created: 2007-11-24
   5 Status: Closed
   6 Implemented-In: 0.2.0.x
   7
   8 0. Status
   9
  10   In 0.2.0.x, this proposal is implemented to the extent needed to
  11   address its motivations.  See notes below with the test "RESOLUTION"
  12   for details.
  13
  14 1. Background and motivation
  15
  16   Right now we can keep a rough count of Tor users, both total and by
  17   country, by watching connections to a single directory mirror. Being
  18   able to get usage estimates is useful both for our funders (to
  19   demonstrate progress) and for our own development (so we know how
  20   quickly we're scaling and can design accordingly, and so we know which
  21   countries and communities to focus on more). This need for information
  22   is the only reason we haven't deployed "directory guards" (think of
  23   them like entry guards but for directory information; in practice,
  24   it would seem that Tor clients should simply use their entry guards
  25   as their directory guards; see also proposal 125).
  26
  27   With the move toward bridges, we will no longer be able to track Tor
  28   clients that use bridges, since they use their bridges as directory
  29   guards. Further, we need to be able to learn which bridges stop seeing
  30   use from certain countries (and are thus likely blocked), so we can
  31   avoid giving them out to other users in those countries.
  32
  33   Right now we already do GeoIP lookups in Vidalia: Vidalia draws relays
  34   and circuits on its 'network map', and it performs anonymized GeoIP
  35   lookups to its central servers to know where to put the dots. Vidalia
  36   caches answers it gets -- to reduce delay, to reduce overhead on
  37   the network, and to reduce anonymity issues where users reveal their
  38   knowledge about the network through which IP addresses they ask about.
  39
  40   But with the advent of bridges, Tor clients are asking about IP
  41   addresses that aren't in the main directory. In particular, bridge
  42   users inform the central Vidalia servers about each bridge as they
  43   discover it and their Vidalia tries to map it.
  44
  45   Also, we wouldn't mind letting Vidalia do a GeoIP lookup on the client's
  46   own IP address, so it can provide a more useful map.
  47
  48   Finally, Vidalia's central servers leave users open to partitioning
  49   attacks, even if they can't target specific users. Further, as we
  50   start using GeoIP results for more operational or security-relevant
  51   goals, such as avoiding or including particular countries in circuits,
  52   it becomes more important that users can't be singled out in terms of
  53   their IP-to-country mapping beliefs.
  54
  55 2. The available GeoIP databases
  56
  57   There are at least two classes of GeoIP database out there: "IP to
  58   country", which tells us the country code for the IP address but
  59   no more details, and "IP to city", which tells us the country code,
  60   the name of the city, and some basic latitude/longitude guesses.
  61
  62   A recent ip-to-country.csv is 3421362 bytes. Compressed, it is 564252
  63   bytes. A typical line is:
  64     "205500992","208605279","US","USA","UNITED STATES"
  65   http://ip-to-country.webhosting.info/node/view/5
  66
  67   Similarly, the maxmind GeoLite Country database is also about 500KB
  68   compressed.
  69   http://www.maxmind.com/app/geolitecountry
  70
  71   The maxmind GeoLite City database gives more finegrained detail like
  72   geo coordinates and city name. Vidalia currently makes use of this
  73   information. On the other hand it's 16MB compressed. A typical line is:
  74     206.124.149.146,Bellevue,WA,US,47.6051,-122.1134
  75   http://www.maxmind.com/app/geolitecity
  76
  77   There are other databases out there, like
  78   http://www.hostip.info/faq.html
  79   http://www.webconfs.com/ip-to-city.php
  80   that want more attention, but for now let's assume that all the db's
  81   are around this size.
  82
  83 3. What we'd like to solve
  84
  85   Goal #1a: Tor relays collect IP-to-country user stats and publish
  86   sanitized versions.
  87   Goal #1b: Tor bridges collect IP-to-country user stats and publish
  88   sanitized versions.
  89
  90   Goal #2a: Vidalia learns IP-to-city stats for Tor relays, for better
  91   mapping.
  92   Goal #2b: Vidalia learns IP-to-country stats for Tor relays, so the user
  93   can pick countries for her paths.
  94
  95   Goal #3: Vidalia doesn't do external lookups on bridge relay addresses.
  96
  97   Goal #4: Vidalia resolves the Tor client's IP-to-country or IP-to-city
  98   for better mapping.
  99
 100   Goal #5: Reduce partitioning opportunities where Vidalia central
 101   servers can give different (distinguishing) responses.
 102
 103 4. Solution overview
 104
 105   Our goal is to allow Tor relays, bridges, and clients to learn enough
 106   GeoIP information so they can do local private queries.
 107
 108 4.1. The IP-to-country db
 109
 110   Directory authorities should publish a "geoip" file that contains
 111   IP-to-country mappings. Directory caches will mirror it, and Tor clients
 112   and relays (including bridge relays) will fetch it. Thus we can solve
 113   goals 1a and 1b (publish sanitized usage info). Controllers could also
 114   use this to solve goal 2b (choosing path by country attributes). It
 115   also solves goal 4 (learning the Tor client's country), though for
 116   huge countries like the US we'd still need to decide where the "middle"
 117   should be when we're mapping that address.
 118
 119   The IP-to-country details are described further in Sections 5 and
 120   6 below.
 121
 122   [RESOLUTION: The geoip file in 0.2.0.x is not distributed through
 123   Tor.  Instead, it is shipped with the bundle.]
 124
 125 4.2. The IP-to-city db
 126
 127   In an ideal world, the IP-to-city db would be small enough that we
 128   could distribute it in the above manner too. But for now, it is too
 129   large. Here's where the design choice forks.
 130
 131   Option A: Vidalia should continue doing its anonymized IP-to-city
 132   queries. Thus we can achieve goals 2a and 2b. We would solve goal
 133   3 by only doing lookups on descriptors that are purpose "general"
 134   (see Section 4.2.1 for how). We would leave goal 5 unsolved.
 135
 136   Option B: Each directory authority should keep an IP-to-city db,
 137   lookup the value for each router it lists, and include that line in
 138   the router's network-status entry. The network-status consensus would
 139   then use the line that appears in the majority of votes. This approach
 140   also solves goals 2a and 2b, goal 3 (Vidalia doesn't do any lookups
 141   at all now), and goal 5 (reduced partitioning risks).
 142
 143   Option B has the advantage that Vidalia can simplify its operation,
 144   and the advantage that this consensus IP-to-city data is available to
 145   other controllers besides just Vidalia. But it has the disadvantage
 146   that the networkstatus consensus becomes larger, even though most of
 147   the GeoIP information won't change from one consensus to the next. Is
 148   there another reasonable location for it that can provide similar
 149   consensus security properties?
 150
 151   [RESOLUTION: IP-to-city is not supported.]
 152
 153 4.2.1. Controllers can query for router annotations
 154
 155   Vidalia needs to stop doing queries on bridge relay IP addresses.
 156   It could do that by only doing lookups on descriptors that are in
 157   the networkstatus consensus, but that precludes designs like Blossom
 158   that might want to map its relay locations. The best answer is that it
 159   should learn the router annotations, with a new controller 'getinfo'
 160   command:
 161     "GETINFO desc-annotations/id/<OR identity>"
 162   which would respond with something like
 163     @downloaded-at 2007-11-29 08:06:38
 164     @source "128.31.0.34"
 165     @purpose bridge
 166
 167   [We could also make the answer include the digest for the router in
 168   question, which would enable us to ask GETINFO router-annotations/all.
 169   Is this worth it? -RD]
 170
 171   Then Vidalia can avoid doing lookups on descriptors with purpose
 172   "bridge". Even better would be to add a new annotation "@private true"
 173   so Vidalia can know how to handle new purposes that we haven't created
 174   yet. Vidalia could special-case "bridge" for now, for compatibility
 175   with the current 0.2.0.x-alphas.
 176
 177 4.3. Recommendation
 178
 179   My overall recommendation is that we should implement 4.1 soon
 180   (e.g. early in 0.2.1.x), and we can go with 4.2 option A for now,
 181   with the hope that later we discover a better way to distribute the
 182   IP-to-city info and can switch to 4.2 option B.
 183
 184   Below we discuss more how to go about achieving 4.1.
 185
 186 5. Publishing and caching the GeoIP (IP-to-country) database
 187
 188   Each v3 directory authority should put a copy of the "geoip" file in
 189   its datadirectory. Then its network-status votes should include a hash
 190   of this file (Recommended-geoip-hash: %s), and the resulting consensus
 191   directory should specify the consensus hash.
 192
 193   There should be a new URL for fetching this geoip db (by "current.z"
 194   for testing purposes, and by hash.z for typical downloads). Authorities
 195   should fetch and serve the one listed in the consensus, even when they
 196   vote for their own. This would argue for storing the cached version
 197   in a better filename than "geoip".
 198
 199   Directory mirrors should keep a copy of this file available via the
 200   same URLs.
 201
 202   We assume that the file would change at most a few times a month. Should
 203   Tor ship with a bootstrap geoip file? An out-of-date geoip file may
 204   open you up to partitioning attacks, but for the most part it won't
 205   be that different.
 206
 207   There should be a config option to disable updating the geoip file,
 208   in case users want to use their own file (e.g. they have a proprietary
 209   GeoIP file they prefer to use). In that case we leave it up to the
 210   user to update his geoip file out-of-band.
 211
 212   [XXX Should consider forward/backward compatibility, e.g. if we want
 213   to move to a new geoip file format. -RD]
 214
 215   [RESOLUTION: Not done over Tor.]
 216
 217 6. Controllers use the IP-to-country db for mapping and for path building
 218
 219   Down the road, Vidalia could use the IP-to-country mappings for placing
 220   on its map:
 221   - The location of the client
 222   - The location of the bridges, or other relays not in the
 223     networkstatus, on the map.
 224   - Any relays that it doesn't yet have an IP-to-city answer for.
 225
 226   Other controllers can also use it to set EntryNodes, ExitNodes, etc
 227   in a per-country way.
 228
 229   To support these features, we need to export the IP-to-country data
 230   via the Tor controller protocol.
 231
 232   Is it sufficient just to add a new GETINFO command?
 233     GETINFO ip-to-country/128.31.0.34
 234     250+ip-to-country/128.31.0.34="US","USA","UNITED STATES"
 235
 236   [RESOLUTION: Not done now, except for the getinfo command.]
 237
 238 6.1. Other interfaces
 239
 240   Robert Hogan has also suggested a
 241
 242     GETINFO relays-by-country/cn
 243
 244   as well as torrc options for ExitCountryCodes, EntryCountryCodes,
 245   ExcludeCountryCodes, etc.
 246
 247   [RESOLUTION: Not implemented in 0.2.0.x.  Fodder for a future proposal.]
 248
 249 7. Relays and bridges use the IP-to-country db for usage summaries
 250
 251   Once bridges have a GeoIP database locally, they can start to publish
 252   sanitized summaries of client usage -- how many users they see and from
 253   what countries. This might also be a more useful way for ordinary Tor
 254   relays to convey the level of usage they see, which would allow us to
 255   switch to using directory guards for all users by default.
 256
 257   But how to safely summarize this information without opening too many
 258   anonymity leaks?
 259
 260 7.1 Attacks to think about
 261
 262   First, note that we need to have a large enough time window that we're
 263   not aiding correlation attacks much. I hope 24 hours is enough. So
 264   that means no publishing stats until you've been up at least 24 hours.
 265   And you can't publish follow-up stats more often than every 24 hours,
 266   or people could look at the differential.
 267
 268   Second, note that we need to be sufficiently vague about the IP
 269   addresses we're reporting. We are hoping that just specifying the
 270   country will be vague enough. But a) what about active attacks where
 271   we convince a bridge to use a GeoIP db that labels each suspect IP
 272   address as a unique country? We have to assume that the consensus GeoIP
 273   db won't be malicious in this way. And b) could such singling-out
 274   attacks occur naturally, for example because of countries that have
 275   a very small IP space? We should investigate that.
 276
 277 7.2. Granularity of users
 278
 279   Do we only want to report countries that have a sufficient anonymity set
 280   (that is, number of users) for the day? For example, we might avoid
 281   listing any countries that have seen less than five addresses over
 282   the 24 hour period. This approach would be helpful in reducing the
 283   singling-out opportunities -- in the extreme case, we could imagine a
 284   situation where one blogger from the Sudan used Tor on a given day, and
 285   we can discover which entry guard she used.
 286
 287   But I fear that especially for bridges, seeing only one hit from a
 288   given country in a given day may be quite common.
 289
 290   As a compromise, we should start out with an "Other" category in
 291   the reported stats, which is the sum of unlisted countries; if that
 292   category is consistently interesting, we can think harder about how
 293   to get the right data from it safely.
 294
 295   But note that bridge summaries will not be made public individually,
 296   since doing so would help people enumerate bridges. Whereas summaries
 297   from normal relays will be public. So perhaps that means we can afford
 298   to be more specific in bridge summaries? In particular, I'm thinking the
 299   "other" category should be used by public relays but not for bridges
 300   (or if it is, used with a lower threshold).
 301
 302   Even for countries that have many Tor users, we might not want to be
 303   too specific about how many users we've seen. For example, we might
 304   round down the number of users we report to the nearest multiple of 5.
 305   My instinct for now is that this won't be that useful.
 306
 307 7.3 Other issues
 308
 309   Another note: we'll likely be overreporting in the case of users with
 310   dynamic IP addresses: if they rotate to a new address over the course
 311   of the day, we'll count them twice. So be it.
 312
 313 7.4. Where to publish the summaries?
 314
 315   We designed extrainfo documents for information like this. So they
 316   should just be more entries in the extrainfo doc.
 317
 318   But if we want to publish summaries every 24 hours (no more often,
 319   no less often), aren't we tried to the router descriptor publishing
 320   schedule? That is, if we publish a new router descriptor at the 18
 321   hour mark, and nothing much has changed at the 24 hour mark, won't
 322   the new descriptor get dropped as being "cosmetically similar", and
 323   then nobody will know to ask about the new extrainfo document?
 324
 325   One solution would be to make and remember the 24 hour summary at the
 326   24 hour mark, but not actually publish it anywhere until we happen to
 327   publish a new descriptor for other reasons. If we happen to go down
 328   before publishing a new descriptor, then so be it, at least we tried.
 329
 330 7.5. What if the relay is unreachable or goes to sleep?
 331
 332   Even if you've been up for 24 hours, if you were hibernating for 18
 333   of them, then we're not getting as much fuzziness as we'd like. So
 334   I guess that means that we need a 24-hour period of being "awake"
 335   before we'll willing to publish a summary. A similar attack works if
 336   you've been awake but unreachable for the first 18 of the 24 hours. As
 337   another example, a bridge that's on a laptop might be suspended for
 338   some of each day.
 339
 340   This implies that some relays and bridges will never publish summary
 341   stats, because they're not ever reliably working for 24 hours in
 342   a row. If a significant percentage of our reporters end up being in
 343   this boat, we should investigate whether we can accumulate 24 hours of
 344   "usefulness", even if there are holes in the middle, and publish based
 345   on that.
 346
 347   What other issues are like this? It seems that just moving to a new
 348   IP address shouldn't be a reason to cancel stats publishing, assuming
 349   we were usable at each address.
 350
 351 7.6. IP addresses that aren't in the geoip db
 352
 353   Some IP addresses aren't in the public geoip databases. In particular,
 354   I've found that a lot of African countries are missing, but there
 355   are also some common ones in the US that are missing, like parts of
 356   Comcast. We could just lump unknown IP addresses into the "other"
 357   category, but it might be useful to gather a general sense of how many
 358   lookups are failing entirely, by adding a separate "Unknown" category.
 359
 360   We could also contribute back to the geoip db, by letting bridges set
 361   a config option to report the actual IP addresses that failed their
 362   lookup. Then the bridge authority operators can manually make sure
 363   the correct answer will be in later geoip files. This config option
 364   should be disabled by default.
 365
 366 7.7 Bringing it all together
 367
 368   So here's the plan:
 369
 370   24 hours after starting up (modulo Section 7.5 above), bridges and
 371   relays should construct a daily summary of client countries they've
 372   seen, including the above "Unknown" category (Section 7.6) as well.
 373
 374   Non-bridge relays lump all countries with less than K (e.g. K=5) users
 375   into the "Other" category (see Sec 7.2 above), whereas bridge relays are
 376   willing to list a country even when it has only one user for the day.
 377
 378   Whenever we have a daily summary on record, we include it in our
 379   extrainfo document whenever we publish one. The daily summary we
 380   remember locally gets replaced with a newer one when another 24
 381   hours pass.
 382
 383 7.8. Some forward secrecy
 384
 385   How should we remember addresses locally? If we convert them into
 386   country-codes immediately, we will count them again if we see them
 387   again. On the other hand, we don't really want to keep a list hanging
 388   around of all IP addresses we've seen in the past 24 hours.
 389
 390   Step one is that we should never write this stuff to disk. Keeping it
 391   only in ram will make things somewhat better. Step two is to avoid
 392   keeping any timestamps associated with it: rather than a rolling
 393   24-hour window, which would require us to remember the various times
 394   we've seen that address, we can instead just throw out the whole list
 395   every 24 hours and start over.
 396
 397   We could hash the addresses, and then compare hashes when deciding if
 398   we've seen a given address before. We could even do keyed hashes. Or
 399   Bloom filters. But if our goal is to defend against an adversary
 400   who steals a copy of our ram while we're running and then does
 401   guess-and-check on whatever blob we're keeping, we're in bad shape.
 402
 403   We could drop the last octet of the IP address as soon as we see
 404   it. That would cause us to undercount some users from cablemodem and
 405   DSL networks that have a high density of Tor users. And it wouldn't
 406   really help that much -- indeed, the extent to which it does help is
 407   exactly the extent to which it makes our stats less useful.
 408
 409   Other ideas?
 410