notes/schema-tradeoffs.txt

   1
   2 The Tradeoff
   3 ============
   4
   5 CVS indexes data in a certain way.  When you create a CVS tag, a label
   6 must be applied to every single file in the repository.  It takes O(N)
   7 time, where N is the size of the tree you're tagging.  The tradeoff,
   8 however, is that someone can look at a specific version of a file, and
   9 see *all* the tag-labels attached to it.
  10
  11 Subversion's repository has the data indexed in the other direction.
  12 When you create an SVN tag, it makes a single new directory node that
  13 points to an existing tree.  It takes O(1) (constant) time.  The
  14 tradeoff, however, is that a version of a file is 'shared' by any
  15 number of directory paths.  That means it takes O(N) time to find
  16 every tag that contains the specific file.
  17
  18
  19 Why?
  20 ===
  21
  22 Why does Subversion index data this way?  There are a few reasons the
  23 designers chose to do this.  Having branches and tags in normal
  24 directory space makes it easy to browse them, easy to do access
  25 control on them, and (of course) they're automatically versioned.
  26
  27 Also, the designers thought this would be optimizing the right
  28 operations.  Organizations tend to create branches and tags quite
  29 frequently -- much more frequently than asking the question "which
  30 tags contain a specific file?"  So if you can only make one of these
  31 operations O(1), you want it to be tagging.  (Of course, if we knew a
  32 way to make them *both* cheap, that would be the best solution!  But
  33 we haven't found a way yet.)
  34
  35
  36
  37 Questions that Users Ask
  38 ========================
  39
  40 Here are some questions subversion users might ask, and how subversion
  41 deals with each question.
  42
  43
  44 1. "What version of foo.c is in tag X?"
  45
  46 This is the easiest question to answer.  Go into the tag-tree, and
  47 look at the version of foo.c it contains.
  48
  49 (This can be done with a simple "svn ls -v URL", where URL is a path
  50 to the specific tag directory.  Look at the first column of numbers.)
  51
  52
  53
  54 2.  "Does tag X contain the latest version of foo.c?"
  55
  56 This is a bit harder to answer.  From question 1, it's easy to see
  57 that tag X contains version N of foo.c.  But how do we know if that's
  58 the *latest* foo.c?
  59
  60 Running 'svn log' on version N of foo.c won't help, because it only
  61 goes backwards in time.  That is, it only shows predecessor nodes, not
  62 successor nodes.
  63
  64 Subversion-1.0 uses BerkeleyDB.  The only reason 'svn log' shows
  65 predecessor nodes easily is because each node contains a back-pointer
  66 to its predecessor.  It would be extremely painful to search BDB for
  67 successors;  BDB is mostly a glorified hashtable with transactions.
  68
  69    [Note from kfogel: Is there some reason we can't store successor
  70    nodes at commit time?  That is if N is a director successor of M,
  71    then when we create N, we add it to M's "successors list".  Then we
  72    could track forward as well as backward... Nothing against having
  73    an SQL backend someday, of course, just pointing out that this
  74    particular problem can be solved simply in Berkeley DB.]
  75
  76 Post-1.0 Subversion, however, will be able to use a SQL backend, and
  77 then it will be very quick and easy to query for node successors.  At
  78 that point, Subversion could make nice "complete history" graphs of
  79 nodes, just like Clearcase does.
  80
  81
  82
  83 3.  "Which tags contain version N of foo.c?"
  84
  85 This is the killer question, and the crux of the Tradeoff mentioned at
  86 the beginning of this document.  Because Subversion has O(1) tagging,
  87 the only way to answer this question is by brute-force searching.
  88
  89 But there are two consolations to this tradeoff:
  90
  91    A) "Rethink your work habits"
  92
  93       From experience, when users ask question #3, it can very often
  94       be rephrased as a question about a *specific* tag.  Very often,
  95       the manager doesn't really want to see the exhaustive list of
  96       every tag containing the file; instead, they simply want to know
  97       if a *certain* tag has the file.  ("Did we give that file to a
  98       particular customer?")  It turns into a "type-1" question.
  99
 100       If you're used to CVS, it's very easy to instantly get the list
 101       of all tags attached to a file-version.  And therefore you
 102       habituate to that, and use the tags-list as your main means of
 103       answering all your type-1 questions.  But it's certainly not
 104       *required* to answer type-1 questions.
 105
 106    B) "Build a cache"
 107
 108       If a brute-force search is ever performed, it shouldn't be too
 109       difficult to cache the results of the search, because repository
 110       trees are immutable.  That means the next time somebody runs the
 111       search, the search becomes *much* smaller.  Eventually, the
 112       search can dwindle down into what feels like O(1) time, at least
 113       when viewed from a distance.  :-)
 114