design-notes.txt

   1                        How cvs2svn.py Works
   2                       =====================
   3
   4 A cvs2svn run consists of five passes.  The first three passes save
   5 their data to files on disk, so that a) we don't hold huge amounts of
   6 state in memory, and b) the conversion process is resumable.
   7
   8 Pass 1:
   9 =======
  10
  11 The goal of this pass is to get a summary of all the revisions for
  12 each file written out to 'cvs2svn-data.revs'; at the end of this
  13 stage, revisions will be grouped by RCS file, not by logical commits.
  14
  15 We walk over the repository, processing each RCS file with
  16 rcsparse.parse(), using cvs2svn's CollectData class, which is a
  17 subclass of rcsparse.Sink(), the parser's callback class.  For each
  18 RCS file, the first thing the parser encounters is the administrative
  19 header, including the head revision, the principal branch, symbolic
  20 names, RCS comments, etc.  The main thing that happens here is that
  21 CollectData.define_tag() is invoked on each symbolic name and its
  22 attached revision, so all the tags and branches of this file get
  23 collected.
  24
  25 Next, the parser hits the revision summary section.  That's the part
  26 of the RCS file that looks like this:
  27
  28    1.6
  29    date 2002.06.12.04.54.12;    author captnmark;       state Exp;
  30    branches
  31         1.6.2.1;
  32    next 1.5;
  33
  34    1.5
  35    date 2002.05.28.18.02.11;    author captnmark;       state Exp;
  36    branches;
  37    next 1.4;
  38
  39    [...]
  40
  41 For each revision summary, CollectData.define_revision() is invoked,
  42 recording that revision's metadata in the self.rev_data[] tree.
  43
  44 After finishing the revision summaries, the parser invokes
  45 CollectData.tree_completed(), which loops over the revisions in
  46 self.rev_data, determining if there are instances where a higher
  47 revision was committed "before" a lower one (rare, but it can happen
  48 when there was clock skew on the repository machine).  If there are
  49 any, it "resyncs" the timestamp of the higher rev to be just after
  50 that of the lower rev, but saves the original timestamp in
  51 self.rev_data[blah][3], so we can later write out a record to the
  52 resync file indicating that an adjustment was made (this makes it
  53 possible to catch the other parts of this commit and resync them
  54 similarly, more details below).
  55
  56 Next, the parser encounters the *real* revision data, which has the
  57 log messages and file contents.  For each revision, it invokes
  58 CollectData.set_revision_info(), which writes a new line to
  59 cvs2svn-data.revs, like this:
  60
  61    3dc32955 5afe9b4ba41843d8eb52ae7db47a43eaa9573254 C 1.2 N * 0 0 foo/bar,v
  62
  63 The fields are:
  64
  65    1. a fixed-width timestamp
  66    2. a digest of the log message + author
  67    3. the type of change ("C"hange, or "D"elete)
  68    4. the revision number
  69    5. "N" if this revision has non-empty deltatext, else "E" for empty
  70    6. the branch on which this commit happened, or "*" if not on a branch
  71    7. the number of tags rooted at this revision (followed by their
  72         names, space-delimited)
  73    8. the number of branches rooted at this revision (followed by
  74         their names, space-delimited)
  75    9. the path of the RCS file in the repository
  76
  77 (Of course, in the above example, fields 6 and 7 are "0", so they have
  78 no additional data.)
  79
  80 Also, for resync'd revisions, a line like this is written out to
  81 'cvs2svn-data.resync':
  82
  83    3d6c1329 18a215a05abea1c6c155dcc7283b88ae7ce23502 3d6c1328
  84
  85 The fields are:
  86
  87    NEW_TIMESTAMP   DIGEST   OLD_TIMESTAMP
  88
  89 (The resync file will be explained later.)
  90
  91 That's it -- the RCS file is done.
  92
  93 When every RCS file is done, Pass 1 is complete, and:
  94
  95    - cvs2svn-data.revs contains a summary of every RCS file's
  96      revisions.  All the revisions for a given RCS file are grouped
  97      together, but note that the groups are in no particular order.
  98      In other words, you can't yet identify the commits from looking
  99      at these lines; a multi-file commit will be scattered all over
 100      the place.
 101
 102    - cvs2svn-data.resync contains a small amount of resync data, in
 103      no particular order.
 104
 105 Pass 2:
 106 =======
 107
 108 This is where the resync file is used.  The goal of this pass is to
 109 convert cvs2svn-data.revs to a new file, 'cvs2svn-data.c-revs' (clean
 110 revs).  It's the same as the original file, except for some resync'd
 111 timestamps.
 112
 113 First, read the whole resync file into a hash table that maps each
 114 author+log digest to a list of lists.  Each sublist represents one of
 115 the timestamp adjustments from Pass 1, and looks like this:
 116
 117    [old_time_lower, old_time_upper, new_time]
 118
 119 The reason to map each digest to a list of sublists, instead of to one
 120 list, is that sometimes you'll get the same digest for unrelated
 121 commits (for example, the same author commits many times using the
 122 empty log message, or a log message that just says "Doc tweaks.").  So
 123 each digest may need to "fan out" to cover multiple commits, but
 124 without accidentally unifying those commits.
 125
 126 Now we loop over cvs2svn-data.revs, writing each line out to
 127 'cvs2svn-data.c-revs'.  Most lines are written out unchanged, but
 128 those whose digest matches some resync entry, and appear to be part of
 129 the same commit as one of the sublists in that entry, get tweaked.
 130 The tweak is to adjust the commit time of the line to the new_time,
 131 which is taken from the resync hash and results from the adjustment
 132 described in Pass 1.
 133
 134 The way we figure out whether a given line needs to be tweaked is to
 135 loop over all the sublists, seeing if this commit's original time
 136 falls within the old<-->new time range for the current sublist.  If it
 137 does, we tweak the line before writing it out, and then conditionally
 138 adjust the sublist's range to account for the timestamp we just
 139 adjusted (since it could be an outlier).  Note that this could, in
 140 theory, result in separate commits being accidentally unified, since
 141 we might gradually adjust the two sides of the range such that they are
 142 eventually more than COMMIT_THRESHOLD seconds apart.  However, this is
 143 really a case of CVS not recording enough information to disambiguate
 144 the commits; we'd know we have a time range that exceeds the
 145 COMMIT_THRESHOLD, but we wouldn't necessarily know where to divide it
 146 up.  We could try some clever heuristic, but for now it's not
 147 important -- after all, we're talking about commits that weren't
 148 important enough to have a distinctive log message anyway, so does it
 149 really matter if a couple of them accidentally get unified?  Probably
 150 not.
 151
 152 Pass 3:
 153 =======
 154
 155 This is where we deduce the changesets, that is, the grouping of file
 156 changes into single commits.
 157
 158 It's very simple -- run 'sort' on cvs2svn-data.c-revs, converting it
 159 to 'cvs2svn-data.s-revs'.  Because of the way the data is laid out,
 160 this causes commits with the same digest (that is, the same author and
 161 log message) to be grouped together.  Poof!  We now have the CVS
 162 changes grouped by logical commit.
 163
 164 In some cases, the changes in a given commit may be interleaved with
 165 other commits that went on at the same time, because the sort gives
 166 precedence to date before log digest.  However, Pass 4 detects this by
 167 seeing that the log digest is different, and reseparates the commits.
 168
 169 Pass 4:
 170 =======
 171
 172 In --dump-only mode, the result of this pass is a Subversion
 173 repository dumpfile (suitable for input to 'svnadmin load').  The
 174 dumpfile is the data's last static stage: last chance to check over
 175 the data, run it through svndumpfilter, move the dumpfile to another
 176 machine, etc.
 177
 178 However, when not in --dump-only mode, no full dumpfile is created for
 179 subsequent load into a Subversion repository.  Instead, miniature
 180 dumpfiles represent a single revision are created, loaded into the
 181 repository, and then removed.
 182
 183 In both modes, the dumpfile revisions are created by walking through
 184 cvs2svn-data.s-revs.
 185
 186                   ===============================
 187                       Branches and Tags Plan.
 188                   ===============================
 189
 190 This pass is also where tag and branch creation is done.  Since
 191 subversion does tags and branches by copying from existing revisions
 192 (then maybe editing the copy, making subcopies underneath, etc), the
 193 big question for cvs2svn is how to achieve the minimum number of
 194 operations per creation.  For example, if it's possible to get the
 195 right tag by just copying revision 53, then it's better to do that
 196 than, say, copying revision 51 and then sub-copying in bits of
 197 revision 52 and 53.
 198
 199 Also, since CVS does not version symbolic names, there is the
 200 secondary question of *when* to create a particular tag or branch.
 201 For example, a tag might have been made at any time after the youngest
 202 commit included in it, or might even have been made piecemeal; and the
 203 same is true for a branch, with the added constraint that for any
 204 particular file, the branch must have been created before the first
 205 commit on the branch.
 206
 207 Answering the second question first: cvs2svn creates tags and branches
 208 as late as possible.  For branches, this is "just in time" creation --
 209 the moment it sees the first commit on a branch, it snaps the entire
 210 branch into existence (or as much of it as possible), and then outputs
 211 the branch commit.
 212
 213 The reason we say "as much of it as possible" is that it's possible to
 214 have a branch where some files have branch commits occuring earlier
 215 than the other files even have the source revisions from which the
 216 branch sprouts (this can happen if the branch was created piecemeal,
 217 for example).  In this case, we create as much of the branch as we
 218 can, that is, as much of it as there are source revisions available to
 219 copy, and leave the rest for later.  "Later" might mean just until
 220 other branch commits come in, or else during a cleanup stage that
 221 happens at the end of this pass (about which more later).
 222
 223 All tags are created during the cleanup stage, after all regular
 224 commits have been made.  That way there's no need to worry whether all
 225 the required revisions for a particular tag have been committed yet,
 226 and it's as correct as any other time, since no one can tell when a
 227 tag was made anyway.
 228
 229 How just-in-time branch creation works:
 230
 231 In order to make the "best" set of copies/deletes when creating a
 232 branch, cvs2svn keeps track of two sets of trees while it's making
 233 commits:
 234
 235    1. A skeleton mirror of the subversion repository, that is, an
 236       array of revisions, with a tree hanging off each revision.  (The
 237       "array" is actually implemented as an anydbm database itself,
 238       mapping string representations of numbers to root keys.)
 239
 240    2. A tree for each CVS symbolic name, and the svn file/directory
 241       revisions from which various parts of that tree could be copied.
 242
 243 Both tree sets live in anydbm databases, using the same basic schema:
 244 unique keys map to marshal.dumps() representations of dictionaries,
 245 which in turn map entry names to other unique keys:
 246
 247    root_key  ==> { entryname1 : entrykey1, entryname2 : entrykey2, ... }
 248    entrykey1 ==> { entrynameX : entrykeyX, ... }
 249    entrykey2 ==> { entrynameY : entrykeyY, ... }
 250    entrykeyX ==> { etc, etc ...}
 251    entrykeyY ==> { etc, etc ...}
 252
 253 (The leaf nodes -- files -- are also dictionaries, for simplicity.)
 254
 255 Both file and directory dictionaries store metadata under special keys
 256 whose names start with "/", so they can always be distinguished from
 257 entries (for example, search for "/mutable", "/openings", or
 258 "/closings" in cvs2svn.py).
 259
 260 The repository mirror allows cvs2svn to remember what paths exist in
 261 what revisions.  For each file path in a revision, it records what
 262 tags and branches can sprout from that revision; when the file
 263 changes, these attributes do not propagate to the new revision, since
 264 the symbolic name isn't based on that revision.
 265
 266 The symbolic name trees are all stored in one db file, as paths, where
 267 the first element in each path is the symbolic name, and the rest is
 268 the full Subversion path to the file in question.  For example, if the
 269 Subversion revision 7 is the root of branch 'Rel_1', this fact would
 270 be recorded under the path
 271
 272    '/Rel_1/myproj/trunk/lib/driver.c'
 273
 274 (the exact layout is dependent on the make_path() function in
 275 cvs2svn.py, which may change).
 276
 277    root_key  ==> { 'Rel_1' : 'a', ... }
 278    'a'       ==> { 'myproj' : 'b', ... }
 279    'b'       ==> { 'trunk : 'c', ... }
 280    'c'       ==> { 'lib' : 'd', ... }
 281    'd'       ==> { 'driver.c' : 'e', ... }
 282    'e'       ==> { }
 283
 284 The source revision is stored in the leaf node, and also in all the
 285 parent nodes, in the manner described in the class documentation for
 286 'SymbolicNameTracker'.  The special entries "/opening" and "/closing"
 287 are not shown above, for brevity, but their values are where the
 288 revision ranges are stored (that is, the ranges indicating when this
 289 path could be copied from to produce the tag or branch in question).
 290
 291 When it's time to create a branch or tag, cvs2svn.py walks the
 292 appropriate symbolic name tree, calculating the ideal source revision
 293 for each subpath (see 'SymbolicNameTracker' for the exact algorithm)
 294 and emitting the minimum number of copies to the dumpfile and to the
 295 skeleton repository mirror.  As it goes, it marks each path as
 296 emitted, so that we don't redo the same copies during the cleanup
 297 phase later on.
 298
 299 At this point, the entire branch is done except for:
 300
 301    1. Any source revisions that haven't yet been committed (this is
 302       a rare situation, but anyway such revisions will automatically
 303       be handled later by the same algorithm, invoked either due to
 304       another commit on the branch, or in the cleanup phase), and
 305
 306    2. Files that were accidentally copied onto the branch as part of a
 307       subtree, but which don't actually belong on the branch, because
 308       the corresponding CVS file doesn't contain that tag.
 309
 310 We handle (2) by doing tree diffs between the newly copied tree in the
 311 skeleton repository mirror, and the corresponding portion of the
 312 symbolic name tree.  If the skeleton mirror has a file that's not in
 313 the symbolic name tree, we emit a delete to the dumpfile and remove
 314 that path from the skeleton mirror.
 315
 316 The cleanup phase happens after all regular changes have been
 317 processed.  Just loop over the "root directory" of the symbolic name
 318 tree, running the same creation algorithm on each name (we'll have to
 319 distinguish between branches and tags, probably through a special
 320 entry on the directory object), skipping parts of the tree already
 321 marked as copied.
 322
 323
 324 Pass 5:
 325 =======
 326
 327 Unless we're skipping cleanup, remove all our intermediate files.
 328
 329
 330
 331
 332 -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*-
 333
 334 Some older notes and ideas about cvs2svn.  Not deleted, because they
 335 may contain suggestions for future improvements in design.
 336
 337 -----------------------------------------------------------------------
 338
 339 An email from John Gardiner Myers <jgmyers@speakeasy.net> about some
 340 considerations for the tool.
 341
 342 ------
 343 From: John Gardiner Myers <jgmyers@speakeasy.net>
 344 Subject: Thoughts on CVS to SVN conversion
 345 To: gstein@lyra.org
 346 Date: Sun, 15 Apr 2001 17:47:10 -0700
 347
 348 Some things you may want to consider for a CVS to SVN conversion utility:
 349
 350 If converting a CVS repository to SVN takes days, it would be good for
 351 the conversion utility to keep its progress state on disk.  If the
 352 conversion fails halfway through due to a network outage or power
 353 failure, that would allow the conversion to be resumed where it left off
 354 instead of having to start over from an empty SVN repository.
 355
 356 It is a short step from there to allowing periodic updates of a
 357 read-only SVN repository from a read/write CVS repository.  This allows
 358 the more relaxed conversion procedure:
 359
 360 1) Create SVN repository writable only by the conversion tool.
 361 2) Update SVN repository from CVS repository.
 362 3) Announce the time of CVS to SVN cutover.
 363 4) Repeat step (2) as needed.
 364 5) Disable commits to CVS repository, making it read-only.
 365 6) Repeat step (2).
 366 7) Enable commits to SVN repository.
 367 8) Wait for developers to move their workspaces to SVN.
 368 9) Decomission the CVS repository.
 369
 370 You may forward this message or parts of it as you seem fit.
 371 ------
 372
 373 -----------------------------------------------------------------------
 374
 375 Further design thoughts from Greg Stein <gstein@lyra.org>
 376
 377 * timestamp the beginning of the process. ignore any commits that
 378   occur after that timestamp; otherwise, you could miss portions of a
 379   commit (e.g. scan A; commit occurs to A and B; scan B; create SVN
 380   revision for items in B; we missed A)
 381
 382 * the above timestamp can also be used for John's "grab any updates
 383   that were missed in the previous pass."
 384
 385 * for each file processed, watch out for simultaneous commits. this
 386   may cause a problem during the reading/scanning/parsing of the file,
 387   or the parse succeeds but the results are garbaged. this could be
 388   fixed with a CVS lock, but I'd prefer read-only access.
 389
 390   algorithm: get the mtime before opening the file. if an error occurs
 391   during reading, and the mtime has changed, then restart the file. if
 392   the read is successful, but the mtime changed, then restart the
 393   file.
 394
 395 * use a separate log to track unique branches and non-branched forks
 396   of revision history (Q: is it possible to create, say, 1.4.1.3
 397   without a "real" branch?). this log can then be used to create a
 398   /branches/ directory in the SVN repository.
 399
 400   Note: we want to determine some way to coalesce branches across
 401   files. It can't be based on name, though, since the same branch name
 402   could be used in multiple places, yet they are semantically
 403   different branches. Given files R, S, and T with branch B, we can
 404   tie those files' branch B into a "semantic group" whenever we see
 405   commit groups on a branch touching multiple files. Files that are
 406   have a (named) branch but no commits on it are simply ignored. For
 407   each "semantic group" of a branch, we'd create a branch based on
 408   their common ancestor, then make the changes on the children as
 409   necessary. For single-file commits to a branch, we could use
 410   heuristics (pathname analysis) to add these to a group (and log what
 411   we did), or we could put them in a "reject" kind of file for a human
 412   to tell us what to do (the human would edit a config file of some
 413   kind to instruct the converter).
 414
 415 * if we have access to the CVSROOT/history, then we could process tags
 416   properly. otherwise, we can only use heuristics or configuration
 417   info to group up tags (branches can use commits; there are no
 418   commits associated with tags)
 419
 420 * ideally, we store every bit of data from the ,v files to enable a
 421   complete restoration of the CVS repository. this could be done by
 422   storing properties with CVS revision numbers and stuff (i.e. all
 423   metadata not already embodied by SVN would go into properties)
 424
 425 * how do we track the "states"? I presume "dead" is simply deleting
 426   the entry from SVN. what are the other legal states, and do we need
 427   to do anything with them?
 428
 429 * where do we put the "description"? how about locks, access list,
 430   keyword flags, etc.
 431
 432 * note that using something like the SourceForge repository will be an
 433   ideal test case. people *move* their repositories there, which means
 434   that all kinds of stuff can be found in those repositories, from
 435   wherever people used to run them, and under whatever development
 436   policies may have been used.
 437
 438   For example: I found one of the projects with a "permissions 644;"
 439   line in the "gnuplot" repository. Most RCS releases issue warnings
 440   about that (although they properly handle/skip the lines).