proposals/108-mtbf-based-stability.txt

   1 Filename: 108-mtbf-based-stability.txt
   2 Title: Base "Stable" Flag on Mean Time Between Failures
   3 Author: Nick Mathewson
   4 Created: 10-Mar-2007
   5 Status: Closed
   6 Implemented-In: 0.2.0.x
   7
   8 Overview:
   9
  10    This document proposes that we change how directory authorities set the
  11    stability flag from inspection of a router's declared Uptime to the
  12    authorities' perceived mean time between failure for the router.
  13
  14 Motivation:
  15
  16    Clients prefer nodes that the authorities call Stable.  This flag is (as
  17    of 0.2.0.0-alpha-dev) set entirely based on the node's declared value for
  18    uptime.  This creates an opportunity for malicious nodes to declare
  19    falsely high uptimes in order to get more traffic.
  20
  21 Spec changes:
  22
  23    Replace the current rule for setting the Stable flag with:
  24
  25    "Stable" -- A router is 'Stable' if it is active and its observed Stability
  26    for the past month is at or above the median Stability for active routers.
  27    Routers are never called stable if they are running a version of Tor
  28    known to drop circuits stupidly. (0.1.1.10-alpha through 0.1.1.16-rc
  29    are stupid this way.)
  30
  31    Stability shall be defined as the weighted mean length of the runs
  32    observed by a given directory authority.  A run begins when an authority
  33    decides that the server is Running, and ends when the authority decides
  34    that the server is not Running.  In-progress runs are counted when
  35    measuring Stability.  When calculating the mean, runs are weighted by
  36    $\alpha ^ t$, where $t$ is time elapsed since the end of the run, and
  37    $0 < \alpha < 1$.  Time when an authority is down do not count to the
  38    length of the run.
  39
  40 Rejected Alternative:
  41
  42    "A router's Stability shall be defined as the sum of $\alpha ^ d$ for every
  43    $d$ such that the router was considered reachable for the entire day
  44    $d$ days ago.
  45
  46    This allows a simpler implementation: every day, we multiply
  47    yesterday's Stability by alpha, and if the router was observed to be
  48    available every time we looked today, we add 1.
  49
  50    Instead of "day", we could pick an arbitrary time unit.  We should
  51    pick alpha to be high enough that long-term stability counts, but low
  52    enough that the distant past is eventually forgotten.  Something
  53    between .8 and .95 seems right.
  54
  55    (By requiring that routers be up for an entire day to get their
  56    stability increased, instead of counting fractions of a day, we
  57    capture the notion that stability is more like "probability of
  58    staying up for the next hour" than it is like "probability of being
  59    up at some randomly chosen time over the next hour."  The former
  60    notion of stability is far more relevant for long-lived circuits.)
  61
  62 Limitations:
  63
  64    Authorities can have false positives and false negatives when trying to
  65    tell whether a router is up or down.  So long as these aren't terribly
  66    wrong, and so long as they aren't significantly biased, we should be able
  67    to use them to estimate stability pretty well.
  68
  69    Probing approaches like the above could miss short incidents of
  70    downtime.  If we use the router's declared uptime, we could detect
  71    these: but doing so would penalize routers who reported their uptime
  72    accurately.
  73
  74 Implementation:
  75
  76    For now, the easiest way to store this information at authorities
  77    would probably be in some kind of periodically flushed flat file.
  78    Later, we could move to Berkeley db or something if we really had to.
  79
  80    For each router, an authority will need to store:
  81      The router ID.
  82      Whether the router is up.
  83      The time when the current run started, if the router is up.
  84      The weighted sum length of all previous runs.
  85      The time at which the weighted sum length was last weighted down.
  86
  87    Servers should probe at random intervals to test whether servers are
  88    running.