1 Filename: 151-path-selection-improvements.txt
2 Title: Improving Tor Path Selection
5 Author: Fallon Chen, Mike Perry
11 The performance of paths selected can be improved by adjusting the
12 CircuitBuildTimeout and avoiding failing guard nodes. This proposal
13 describes a method of tracking buildtime statistics at the client, and
14 using those statistics to adjust the CircuitBuildTimeout.
18 Tor's performance can be improved by excluding those circuits that
19 have long buildtimes (and by extension, high latency). For those Tor
20 users who require better performance and have lower requirements for
21 anonymity, this would be a very useful option to have.
27 Circuit build times will be stored in the circular array
28 'circuit_build_times' consisting of uint16_t elements as milliseconds.
29 The total size of this array will be based on the number of circuits
30 it takes to converge on a good fit of the long term distribution of
31 the circuit builds for a fixed link. We do not want this value to be
32 too large, because it will make it difficult for clients to adapt to
33 moving between different links.
35 From our initial observations, this value appears to be on the order
36 of 1000, but will be configurable in a #define NCIRCUITS_TO_OBSERVE.
37 The exact value for this #define will be determined by performing
38 goodness of fit tests using measurments obtained from the shufflebt.py
43 The long-term storage representation will be implemented by storing a
44 histogram with BUILDTIME_BIN_WIDTH millisecond buckets (default 50) when
45 writing out the statistics to disk. The format of this histogram on disk
46 is yet to be finalized, but it will likely be of the format
47 'CircuitBuildTime <bin> <count>', with the total specified as
48 'TotalBuildTimes <total>'
52 CircuitBuildTimeBin 1 50
53 CircuitBuildTimeBin 2 25
54 CircuitBuildTimeBin 3 13
57 Reading the histogram in will entail multiplying each bin by the
58 BUILDTIME_BIN_WIDTH and then inserting <count> values into the
59 circuit_build_times array each with the value of
60 <bin>*BUILDTIME_BIN_WIDTH. In order to evenly distribute the
61 values in the circular array, a form of index skipping must
62 be employed. Values from bin #N with bin count C and total T
63 will occupy indexes specified by N+((T/C)*k)-1, where k is the
64 set of integers ranging from 0 to C-1.
66 For example, this would mean that the values from bin 1 would
67 occupy indexes 1+(100/50)*k-1, or 0, 2, 4, 6, 8, 10 and so on.
68 The values for bin 2 would occupy positions 1, 5, 9, 13. Collisions
69 will be inserted at the first empty position in the array greater
70 than the selected index (which may requiring looping around the
71 array back to index 0).
73 Learning the CircuitBuildTimeout
75 Based on studies of build times, we found that the distribution of
76 circuit buildtimes appears to be a Pareto distribution.
78 We will calculate the parameters for a Pareto distribution
79 fitting the data using the estimators at
80 http://en.wikipedia.org/wiki/Pareto_distribution#Parameter_estimation.
82 The timeout itself will be calculated by solving the CDF for the
83 a percentile cutoff BUILDTIME_PERCENT_CUTOFF. This value
84 represents the percentage of paths the Tor client will accept out of
85 the total number of paths. We have not yet determined a good
86 cutoff for this mathematically, but 85% seems a good choice for now.
88 From http://en.wikipedia.org/wiki/Pareto_distribution#Definition,
89 the calculation we need is pow(BUILDTIME_PERCENT_CUTOFF/100.0, k)/Xm.
93 After circuit build times, storage, and learning are implemented,
94 the resulting histogram should be checked for consistency by
95 verifying it persists across successive Tor invocations where
96 no circuits are built. In addition, we can also use the existing
97 buildtime scripts to record build times, and verify that the histogram
98 the python produces matches that which is output to the state file in Tor,
99 and verify that the Pareto parameters and cutoff points also match.
101 Soft timeout vs Hard Timeout
103 At some point, it may be desirable to change the cutoff from a
104 single hard cutoff that destroys the circuit to a soft cutoff and
105 a hard cutoff, where the soft cutoff merely triggers the building
106 of a new circuit, and the hard cutoff triggers destruction of the
109 Good values for hard and soft cutoffs seem to be 85% and 65%
110 respectively, but we should eventually justify this with observation.
112 When to Begin Calculation
114 The number of circuits to observe (NCIRCUITS_TO_CUTOFF) before
115 changing the CircuitBuildTimeout will be tunable via a #define. From
116 our measurements, a good value for NCIRCUITS_TO_CUTOFF appears to be
119 Dealing with Timeouts
121 Timeouts should be counted as the expectation of the region of
122 of the Pareto distribution beyond the cutoff. The proposal will
123 be updated with this value soon.
125 Also, in the event of network failure, the observation mechanism
126 should stop collecting timeout data.
130 Some research still needs to be done to provide initial values
131 for CircuitBuildTimeout based on values learned from modem
132 users, DSL users, Cable Modem users, and dedicated links. A
133 radiobutton in Vidalia should eventually be provided that
134 sets CircuitBuildTimeout to one of these values and also
135 provide the option of purging all learned data, should any exist.
137 These values can either be published in the directory, or
138 shipped hardcoded for a particular Tor version.
144 Since this follows a Pareto distribution, large reductions on the
145 timeout can be achieved without cutting off a great number of the
146 total paths. This will eliminate a great deal of the performance
147 variation of Tor usage.