doc/spec/proposals/151-path-selection-improvements.txt

   1 Filename: 151-path-selection-improvements.txt
   2 Title: Improving Tor Path Selection
   3 Version:
   4 Last-Modified:
   5 Author: Fallon Chen, Mike Perry
   6 Created: 5-Jul-2008
   7 Status: Draft
   8
   9 Overview
  10
  11   The performance of paths selected can be improved by adjusting the
  12   CircuitBuildTimeout and avoiding failing guard nodes. This proposal
  13   describes a method of tracking buildtime statistics at the client, and
  14   using those statistics to adjust the CircuitBuildTimeout.
  15
  16 Motivation
  17
  18   Tor's performance can be improved by excluding those circuits that
  19   have long buildtimes (and by extension, high latency). For those Tor
  20   users who require better performance and have lower requirements for
  21   anonymity, this would be a very useful option to have.
  22
  23 Implementation
  24
  25   Storing Build Times
  26
  27     Circuit build times will be stored in the circular array
  28     'circuit_build_times' consisting of uint16_t elements as milliseconds.
  29     The total size of this array will be based on the number of circuits
  30     it takes to converge on a good fit of the long term distribution of
  31     the circuit builds for a fixed link. We do not want this value to be
  32     too large, because it will make it difficult for clients to adapt to
  33     moving between different links.
  34
  35     From our initial observations, this value appears to be on the order
  36     of 1000, but will be configurable in a #define NCIRCUITS_TO_OBSERVE.
  37     The exact value for this #define will be determined by performing
  38     goodness of fit tests using measurments obtained from the shufflebt.py
  39     script from TorFlow.
  40
  41   Long Term Storage
  42
  43     The long-term storage representation will be implemented by storing a
  44     histogram with BUILDTIME_BIN_WIDTH millisecond buckets (default 50) when
  45     writing out the statistics to disk. The format of this histogram on disk
  46     is yet to be finalized, but it will likely be of the format
  47     'CircuitBuildTime <bin> <count>', with the total specified as
  48     'TotalBuildTimes <total>'
  49     Example:
  50
  51     TotalBuildTimes 100
  52     CircuitBuildTimeBin 1 50
  53     CircuitBuildTimeBin 2 25
  54     CircuitBuildTimeBin 3 13
  55     ...
  56
  57     Reading the histogram in will entail multiplying each bin by the
  58     BUILDTIME_BIN_WIDTH and then inserting <count> values into the
  59     circuit_build_times array each with the value of
  60     <bin>*BUILDTIME_BIN_WIDTH. In order to evenly distribute the
  61     values in the circular array, a form of index skipping must
  62     be employed. Values from bin #N with bin count C and total T
  63     will occupy indexes specified by N+((T/C)*k)-1, where k is the
  64     set of integers ranging from 0 to C-1.
  65
  66     For example, this would mean that the values from bin 1 would
  67     occupy indexes 1+(100/50)*k-1, or 0, 2, 4, 6, 8, 10 and so on.
  68     The values for bin 2 would occupy positions 1, 5, 9, 13. Collisions
  69     will be inserted at the first empty position in the array greater
  70     than the selected index (which may requiring looping around the
  71     array back to index 0).
  72
  73   Learning the CircuitBuildTimeout
  74
  75     Based on studies of build times, we found that the distribution of
  76     circuit buildtimes appears to be a Pareto distribution.
  77
  78     We will calculate the parameters for a Pareto distribution
  79     fitting the data using the estimators at
  80     http://en.wikipedia.org/wiki/Pareto_distribution#Parameter_estimation.
  81
  82     The timeout itself will be calculated by solving the CDF for the
  83     a percentile cutoff BUILDTIME_PERCENT_CUTOFF. This value
  84     represents the percentage of paths the Tor client will accept out of
  85     the total number of paths. We have not yet determined a good
  86     cutoff for this mathematically, but 85% seems a good choice for now.
  87
  88     From http://en.wikipedia.org/wiki/Pareto_distribution#Definition,
  89     the calculation we need is pow(BUILDTIME_PERCENT_CUTOFF/100.0, k)/Xm.
  90
  91   Testing
  92
  93     After circuit build times, storage, and learning are implemented,
  94     the resulting histogram should be checked for consistency by
  95     verifying it persists across successive Tor invocations where
  96     no circuits are built. In addition, we can also use the existing
  97     buildtime scripts to record build times, and verify that the histogram
  98     the python produces matches that which is output to the state file in Tor,
  99     and verify that the Pareto parameters and cutoff points also match.
 100
 101   Soft timeout vs Hard Timeout
 102
 103     At some point, it may be desirable to change the cutoff from a
 104     single hard cutoff that destroys the circuit to a soft cutoff and
 105     a hard cutoff, where the soft cutoff merely triggers the building
 106     of a new circuit, and the hard cutoff triggers destruction of the
 107     circuit.
 108
 109     Good values for hard and soft cutoffs seem to be 85% and 65%
 110     respectively, but we should eventually justify this with observation.
 111
 112   When to Begin Calculation
 113
 114     The number of circuits to observe (NCIRCUITS_TO_CUTOFF) before
 115     changing the CircuitBuildTimeout will be tunable via a #define. From
 116     our measurements, a good value for NCIRCUITS_TO_CUTOFF appears to be
 117     on the order of 100.
 118
 119   Dealing with Timeouts
 120
 121     Timeouts should be counted as the expectation of the region of
 122     of the Pareto distribution beyond the cutoff. The proposal will
 123     be updated with this value soon.
 124
 125     Also, in the event of network failure, the observation mechanism
 126     should stop collecting timeout data.
 127
 128   Client Hints
 129
 130     Some research still needs to be done to provide initial values
 131     for CircuitBuildTimeout based on values learned from modem
 132     users, DSL users, Cable Modem users, and dedicated links. A
 133     radiobutton in Vidalia should eventually be provided that
 134     sets CircuitBuildTimeout to one of these values and also
 135     provide the option of purging all learned data, should any exist.
 136
 137     These values can either be published in the directory, or
 138     shipped hardcoded for a particular Tor version.
 139
 140 Issues
 141
 142   Impact on anonymity
 143
 144     Since this follows a Pareto distribution, large reductions on the
 145     timeout can be achieved without cutting off a great number of the
 146     total paths. This will eliminate a great deal of the performance
 147     variation of Tor usage.