design-notes.txt

   1                        How cvs2svn.py Works
   2                       =====================
   3
   4 A cvs2svn run consists of five passes.  The first three passes save
   5 their data to files on disk, so that a) we don't hold huge amounts of
   6 state in memory, and b) the conversion process is resumable.
   7
   8 Pass 1:
   9 =======
  10
  11 The goal of this pass is to get a summary of all the revisions for
  12 each file written out to 'cvs2svn-data.revs'; at the end of this
  13 stage, revisions will be grouped by RCS file, not by logical commits.
  14
  15 We walk over the repository, processing each RCS file with
  16 rcsparse.parse(), using cvs2svn's CollectData class, which is a
  17 subclass of rcsparse.Sink(), the parser's callback class.  For each
  18 RCS file, the first thing the parser encounters is the administrative
  19 header, including the head revision, the principal branch, symbolic
  20 names, RCS comments, etc.  The main thing that happens here is that
  21 CollectData.define_tag() is invoked on each symbolic name and its
  22 attached revision, so all the tags and branches of this file get
  23 collected.
  24
  25 Next, the parser hits the revision summary section.  That's the part
  26 of the RCS file that looks like this:
  27
  28    1.6
  29    date 2002.06.12.04.54.12;    author captnmark;       state Exp;
  30    branches
  31         1.6.2.1;
  32    next 1.5;
  33
  34    1.5
  35    date 2002.05.28.18.02.11;    author captnmark;       state Exp;
  36    branches;
  37    next 1.4;
  38
  39    [...]
  40
  41 For each revision summary, CollectData.define_revision() is invoked,
  42 recording that revision's metadata in the self.rev_data[] tree.
  43
  44 After finishing the revision summaries, the parser invokes
  45 CollectData.tree_completed(), which loops over the revisions in
  46 self.rev_data, determining if there are instances where a higher
  47 revision was committed "before" a lower one (rare, but it can happen
  48 when there was clock skew on the repository machine).  If there are
  49 any, it "resyncs" the timestamp of the higher rev to be just after
  50 that of the lower rev, but saves the original timestamp in
  51 self.rev_data[blah][3], so we can later write out a record to the
  52 resync file indicating that an adjustment was made (this makes it
  53 possible to catch the other parts of this commit and resync them
  54 similarly, more details below).
  55
  56 Next, the parser encounters the *real* revision data, which has the
  57 log messages and file contents.  For each revision, it invokes
  58 CollectData.set_revision_info(), which writes a new line to
  59 cvs2svn-data.revs, like this:
  60
  61    3dc32955 5afe9b4ba41843d8eb52ae7db47a43eaa9573254 C 1.1 1.2 1.3 1 1 1024 N * 0 0 foo/bar,v
  62
  63 The fields are:
  64
  65    1.  a fixed-width timestamp
  66    2.  a digest of the log message + author
  67    3.  the type of change ("A"dd, "C"hange, or "D"elete)
  68    4.  the revision number of the previous revision along this line of
  69        development (or "*", if it's the first revision on this line of
  70        development).
  71    5.  the revision number
  72    6.  the revision number of the next revision along this line of
  73        development (or "*", if it's the last revision on this line of
  74        development).
  75    7.  1 if the RCS file is in the Attic, "*" if it isn't.
  76    8.  1 is the RCS file has the executable bit set, "*" if not.
  77    9.  The size of the RCS file, in bytes.
  78    10.  "N" if this revision has non-empty deltatext, else "E" for empty
  79    11.  the RCS keyword substitution mode ("kkv", "kb", etc), or "*" if none
  80    12.  the branch on which this commit happened, or "*" if not on a branch
  81    13.  the number of tags rooted at this revision (followed by their
  82        names, space-delimited)
  83    14. the number of branches rooted at this revision (followed by
  84        their names, space-delimited)
  85    15. the path of the RCS file in the repository
  86
  87 (Of course, in the above example, fields 9 and 10 are "0", so they have
  88 no additional data.)
  89
  90 Also, for resync'd revisions, a line like this is written out to
  91 'cvs2svn-data.resync':
  92
  93    3d6c1329 18a215a05abea1c6c155dcc7283b88ae7ce23502 3d6c1328
  94
  95 The fields are:
  96
  97    NEW_TIMESTAMP   DIGEST   OLD_TIMESTAMP
  98
  99 (The resync file will be explained later.)
 100
 101 That's it -- the RCS file is done.
 102
 103 When every RCS file is done, Pass 1 is complete, and:
 104
 105    - cvs2svn-data.revs contains a summary of every RCS file's
 106      revisions.  All the revisions for a given RCS file are grouped
 107      together, but note that the groups are in no particular order.
 108      In other words, you can't yet identify the commits from looking
 109      at these lines; a multi-file commit will be scattered all over
 110      the place.
 111
 112    - cvs2svn-data.resync contains a small amount of resync data, in
 113      no particular order.
 114
 115 Pass 2:
 116 =======
 117
 118 This is where the resync file is used.  The goal of this pass is to
 119 convert cvs2svn-data.revs to a new file, 'cvs2svn-data.c-revs' (clean
 120 revs).  It's the same as the original file, except for some resync'd
 121 timestamps.
 122
 123 First, read the whole resync file into a hash table that maps each
 124 author+log digest to a list of lists.  Each sublist represents one of
 125 the timestamp adjustments from Pass 1, and looks like this:
 126
 127    [old_time_lower, old_time_upper, new_time]
 128
 129 The reason to map each digest to a list of sublists, instead of to one
 130 list, is that sometimes you'll get the same digest for unrelated
 131 commits (for example, the same author commits many times using the
 132 empty log message, or a log message that just says "Doc tweaks.").  So
 133 each digest may need to "fan out" to cover multiple commits, but
 134 without accidentally unifying those commits.
 135
 136 Now we loop over cvs2svn-data.revs, writing each line out to
 137 'cvs2svn-data.c-revs'.  Most lines are written out unchanged, but
 138 those whose digest matches some resync entry, and appear to be part of
 139 the same commit as one of the sublists in that entry, get tweaked.
 140 The tweak is to adjust the commit time of the line to the new_time,
 141 which is taken from the resync hash and results from the adjustment
 142 described in Pass 1.
 143
 144 The way we figure out whether a given line needs to be tweaked is to
 145 loop over all the sublists, seeing if this commit's original time
 146 falls within the old<-->new time range for the current sublist.  If it
 147 does, we tweak the line before writing it out, and then conditionally
 148 adjust the sublist's range to account for the timestamp we just
 149 adjusted (since it could be an outlier).  Note that this could, in
 150 theory, result in separate commits being accidentally unified, since
 151 we might gradually adjust the two sides of the range such that they are
 152 eventually more than COMMIT_THRESHOLD seconds apart.  However, this is
 153 really a case of CVS not recording enough information to disambiguate
 154 the commits; we'd know we have a time range that exceeds the
 155 COMMIT_THRESHOLD, but we wouldn't necessarily know where to divide it
 156 up.  We could try some clever heuristic, but for now it's not
 157 important -- after all, we're talking about commits that weren't
 158 important enough to have a distinctive log message anyway, so does it
 159 really matter if a couple of them accidentally get unified?  Probably
 160 not.
 161
 162 Pass 3:
 163 =======
 164
 165 This is where we deduce the changesets, that is, the grouping of file
 166 changes into single commits.
 167
 168 It's very simple -- run 'sort' on cvs2svn-data.c-revs, converting it
 169 to 'cvs2svn-data.s-revs'.  Because of the way the data is laid out,
 170 this causes commits with the same digest (that is, the same author and
 171 log message) to be grouped together.  Poof!  We now have the CVS
 172 changes grouped by logical commit.
 173
 174 In some cases, the changes in a given commit may be interleaved with
 175 other commits that went on at the same time, because the sort gives
 176 precedence to date before log digest.  However, Pass 4 detects this by
 177 seeing that the log digest is different, and reseparates the commits.
 178
 179 Pass 4:
 180 =======
 181
 182 In --dump-only mode, the result of this pass is a Subversion
 183 repository dumpfile (suitable for input to 'svnadmin load').  The
 184 dumpfile is the data's last static stage: last chance to check over
 185 the data, run it through svndumpfilter, move the dumpfile to another
 186 machine, etc.
 187
 188 However, when not in --dump-only mode, no full dumpfile is created for
 189 subsequent load into a Subversion repository.  Instead, miniature
 190 dumpfiles represent a single revision are created, loaded into the
 191 repository, and then removed.
 192
 193 In both modes, the dumpfile revisions are created by walking through
 194 cvs2svn-data.s-revs.
 195
 196                   ===============================
 197                       Branches and Tags Plan.
 198                   ===============================
 199
 200 This pass is also where tag and branch creation is done.  Since
 201 subversion does tags and branches by copying from existing revisions
 202 (then maybe editing the copy, making subcopies underneath, etc), the
 203 big question for cvs2svn is how to achieve the minimum number of
 204 operations per creation.  For example, if it's possible to get the
 205 right tag by just copying revision 53, then it's better to do that
 206 than, say, copying revision 51 and then sub-copying in bits of
 207 revision 52 and 53.
 208
 209 Also, since CVS does not version symbolic names, there is the
 210 secondary question of *when* to create a particular tag or branch.
 211 For example, a tag might have been made at any time after the youngest
 212 commit included in it, or might even have been made piecemeal; and the
 213 same is true for a branch, with the added constraint that for any
 214 particular file, the branch must have been created before the first
 215 commit on the branch.
 216
 217 Answering the second question first: cvs2svn creates tags and branches
 218 as late as possible.  For branches, this is "just in time" creation --
 219 the moment it sees the first commit on a branch, it snaps the entire
 220 branch into existence (or as much of it as possible), and then outputs
 221 the branch commit.
 222
 223 The reason we say "as much of it as possible" is that it's possible to
 224 have a branch where some files have branch commits occuring earlier
 225 than the other files even have the source revisions from which the
 226 branch sprouts (this can happen if the branch was created piecemeal,
 227 for example).  In this case, we create as much of the branch as we
 228 can, that is, as much of it as there are source revisions available to
 229 copy, and leave the rest for later.  "Later" might mean just until
 230 other branch commits come in, or else during a cleanup stage that
 231 happens at the end of this pass (about which more later).
 232
 233 All tags are created during the cleanup stage, after all regular
 234 commits have been made.  That way there's no need to worry whether all
 235 the required revisions for a particular tag have been committed yet,
 236 and it's as correct as any other time, since no one can tell when a
 237 tag was made anyway.
 238
 239 How just-in-time branch creation works:
 240
 241 In order to make the "best" set of copies/deletes when creating a
 242 branch, cvs2svn keeps track of two sets of trees while it's making
 243 commits:
 244
 245    1. A skeleton mirror of the subversion repository, that is, an
 246       array of revisions, with a tree hanging off each revision.  (The
 247       "array" is actually implemented as an anydbm database itself,
 248       mapping string representations of numbers to root keys.)
 249
 250    2. A tree for each CVS symbolic name, and the svn file/directory
 251       revisions from which various parts of that tree could be copied.
 252
 253 Both tree sets live in anydbm databases, using the same basic schema:
 254 unique keys map to marshal.dumps() representations of dictionaries,
 255 which in turn map entry names to other unique keys:
 256
 257    root_key  ==> { entryname1 : entrykey1, entryname2 : entrykey2, ... }
 258    entrykey1 ==> { entrynameX : entrykeyX, ... }
 259    entrykey2 ==> { entrynameY : entrykeyY, ... }
 260    entrykeyX ==> { etc, etc ...}
 261    entrykeyY ==> { etc, etc ...}
 262
 263 (The leaf nodes -- files -- are also dictionaries, for simplicity.)
 264
 265 Both file and directory dictionaries store metadata under special keys
 266 whose names start with "/", so they can always be distinguished from
 267 entries (for example, search for "/mutable", "/openings", or
 268 "/closings" in cvs2svn.py).
 269
 270 The repository mirror allows cvs2svn to remember what paths exist in
 271 what revisions.  For each file path in a revision, it records what
 272 tags and branches can sprout from that revision; when the file
 273 changes, these attributes do not propagate to the new revision, since
 274 the symbolic name isn't based on that revision.
 275
 276 The symbolic name trees are all stored in one db file, as paths, where
 277 the first element in each path is the symbolic name, and the rest is
 278 the full Subversion path to the file in question.  For example, if the
 279 Subversion revision 7 is the root of branch 'Rel_1', this fact would
 280 be recorded under the path
 281
 282    '/Rel_1/myproj/trunk/lib/driver.c'
 283
 284 (the exact layout is dependent on the make_path() function in
 285 cvs2svn.py, which may change).
 286
 287    root_key  ==> { 'Rel_1' : 'a', ... }
 288    'a'       ==> { 'myproj' : 'b', ... }
 289    'b'       ==> { 'trunk : 'c', ... }
 290    'c'       ==> { 'lib' : 'd', ... }
 291    'd'       ==> { 'driver.c' : 'e', ... }
 292    'e'       ==> { }
 293
 294 The source revision is stored in the leaf node, and also in all the
 295 parent nodes, in the manner described in the class documentation for
 296 'SymbolicNameTracker'.  The special entries "/opening" and "/closing"
 297 are not shown above, for brevity, but their values are where the
 298 revision ranges are stored (that is, the ranges indicating when this
 299 path could be copied from to produce the tag or branch in question).
 300
 301 When it's time to create a branch or tag, cvs2svn.py walks the
 302 appropriate symbolic name tree, calculating the ideal source revision
 303 for each subpath (see 'SymbolicNameTracker' for the exact algorithm)
 304 and emitting the minimum number of copies to the dumpfile and to the
 305 skeleton repository mirror.  As it goes, it marks each path as
 306 emitted, so that we don't redo the same copies during the cleanup
 307 phase later on.
 308
 309 At this point, the entire branch is done except for:
 310
 311    1. Any source revisions that haven't yet been committed (this is
 312       a rare situation, but anyway such revisions will automatically
 313       be handled later by the same algorithm, invoked either due to
 314       another commit on the branch, or in the cleanup phase), and
 315
 316    2. Files that were accidentally copied onto the branch as part of a
 317       subtree, but which don't actually belong on the branch, because
 318       the corresponding CVS file doesn't contain that tag.
 319
 320 We handle (2) by doing tree diffs between the newly copied tree in the
 321 skeleton repository mirror, and the corresponding portion of the
 322 symbolic name tree.  If the skeleton mirror has a file that's not in
 323 the symbolic name tree, we emit a delete to the dumpfile and remove
 324 that path from the skeleton mirror.
 325
 326 The cleanup phase happens after all regular changes have been
 327 processed.  Just loop over the "root directory" of the symbolic name
 328 tree, running the same creation algorithm on each name (we'll have to
 329 distinguish between branches and tags, probably through a special
 330 entry on the directory object), skipping parts of the tree already
 331 marked as copied.
 332
 333
 334 Pass 5:
 335 =======
 336
 337 Unless we're skipping cleanup, remove all our intermediate files.
 338
 339
 340
 341
 342 -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*-
 343
 344 Some older notes and ideas about cvs2svn.  Not deleted, because they
 345 may contain suggestions for future improvements in design.
 346
 347 -----------------------------------------------------------------------
 348
 349 An email from John Gardiner Myers <jgmyers@speakeasy.net> about some
 350 considerations for the tool.
 351
 352 ------
 353 From: John Gardiner Myers <jgmyers@speakeasy.net>
 354 Subject: Thoughts on CVS to SVN conversion
 355 To: gstein@lyra.org
 356 Date: Sun, 15 Apr 2001 17:47:10 -0700
 357
 358 Some things you may want to consider for a CVS to SVN conversion utility:
 359
 360 If converting a CVS repository to SVN takes days, it would be good for
 361 the conversion utility to keep its progress state on disk.  If the
 362 conversion fails halfway through due to a network outage or power
 363 failure, that would allow the conversion to be resumed where it left off
 364 instead of having to start over from an empty SVN repository.
 365
 366 It is a short step from there to allowing periodic updates of a
 367 read-only SVN repository from a read/write CVS repository.  This allows
 368 the more relaxed conversion procedure:
 369
 370 1) Create SVN repository writable only by the conversion tool.
 371 2) Update SVN repository from CVS repository.
 372 3) Announce the time of CVS to SVN cutover.
 373 4) Repeat step (2) as needed.
 374 5) Disable commits to CVS repository, making it read-only.
 375 6) Repeat step (2).
 376 7) Enable commits to SVN repository.
 377 8) Wait for developers to move their workspaces to SVN.
 378 9) Decomission the CVS repository.
 379
 380 You may forward this message or parts of it as you seem fit.
 381 ------
 382
 383 -----------------------------------------------------------------------
 384
 385 Further design thoughts from Greg Stein <gstein@lyra.org>
 386
 387 * timestamp the beginning of the process. ignore any commits that
 388   occur after that timestamp; otherwise, you could miss portions of a
 389   commit (e.g. scan A; commit occurs to A and B; scan B; create SVN
 390   revision for items in B; we missed A)
 391
 392 * the above timestamp can also be used for John's "grab any updates
 393   that were missed in the previous pass."
 394
 395 * for each file processed, watch out for simultaneous commits. this
 396   may cause a problem during the reading/scanning/parsing of the file,
 397   or the parse succeeds but the results are garbaged. this could be
 398   fixed with a CVS lock, but I'd prefer read-only access.
 399
 400   algorithm: get the mtime before opening the file. if an error occurs
 401   during reading, and the mtime has changed, then restart the file. if
 402   the read is successful, but the mtime changed, then restart the
 403   file.
 404
 405 * use a separate log to track unique branches and non-branched forks
 406   of revision history (Q: is it possible to create, say, 1.4.1.3
 407   without a "real" branch?). this log can then be used to create a
 408   /branches/ directory in the SVN repository.
 409
 410   Note: we want to determine some way to coalesce branches across
 411   files. It can't be based on name, though, since the same branch name
 412   could be used in multiple places, yet they are semantically
 413   different branches. Given files R, S, and T with branch B, we can
 414   tie those files' branch B into a "semantic group" whenever we see
 415   commit groups on a branch touching multiple files. Files that are
 416   have a (named) branch but no commits on it are simply ignored. For
 417   each "semantic group" of a branch, we'd create a branch based on
 418   their common ancestor, then make the changes on the children as
 419   necessary. For single-file commits to a branch, we could use
 420   heuristics (pathname analysis) to add these to a group (and log what
 421   we did), or we could put them in a "reject" kind of file for a human
 422   to tell us what to do (the human would edit a config file of some
 423   kind to instruct the converter).
 424
 425 * if we have access to the CVSROOT/history, then we could process tags
 426   properly. otherwise, we can only use heuristics or configuration
 427   info to group up tags (branches can use commits; there are no
 428   commits associated with tags)
 429
 430 * ideally, we store every bit of data from the ,v files to enable a
 431   complete restoration of the CVS repository. this could be done by
 432   storing properties with CVS revision numbers and stuff (i.e. all
 433   metadata not already embodied by SVN would go into properties)
 434
 435 * how do we track the "states"? I presume "dead" is simply deleting
 436   the entry from SVN. what are the other legal states, and do we need
 437   to do anything with them?
 438
 439 * where do we put the "description"? how about locks, access list,
 440   keyword flags, etc.
 441
 442 * note that using something like the SourceForge repository will be an
 443   ideal test case. people *move* their repositories there, which means
 444   that all kinds of stuff can be found in those repositories, from
 445   wherever people used to run them, and under whatever development
 446   policies may have been used.
 447
 448   For example: I found one of the projects with a "permissions 644;"
 449   line in the "gnuplot" repository. Most RCS releases issue warnings
 450   about that (although they properly handle/skip the lines).