design-notes.txt

   1                          How cvs2svn Works
   2                          =================
   3
   4 A cvs2svn run consists of eight passes.  Each pass saves the data it
   5 produces to files on disk, so that a) we don't hold huge amounts of
   6 state in memory, and b) the conversion process is resumable.
   7
   8 CollectRevsPass (formerly called pass1)
   9 ===============
  10
  11 The goal of this pass is to write to 'cvs2svn-data.revs' a summary of
  12 all the revisions for each RCS file.  Each revision will be
  13 represented by one line.  At the end of this stage, the revisions
  14 (i.e., the lines) will be grouped by RCS file, not by logical commits.
  15
  16 We walk over the repository, collecting data about the RCS files into
  17 an instance of CollectData.  Each RCS file is processed with
  18 rcsparse.parse(), which invokes callbacks from an instance of
  19 cvs2svn's FileDataCollector class (which is a subclass of
  20 rcsparse.Sink).
  21
  22 For each RCS file, the first thing the parser encounters is the
  23 administrative header, including the head revision, the principal
  24 branch, symbolic names, RCS comments, etc.  The main thing that
  25 happens here is that FileDataCollector.define_tag() is invoked on each
  26 symbolic name and its attached revision, so all the tags and branches
  27 of this file get collected.
  28
  29 Next, the parser hits the revision summary section.  That's the part
  30 of the RCS file that looks like this:
  31
  32    1.6
  33    date 2002.06.12.04.54.12;    author captnmark;       state Exp;
  34    branches
  35         1.6.2.1;
  36    next 1.5;
  37
  38    1.5
  39    date 2002.05.28.18.02.11;    author captnmark;       state Exp;
  40    branches;
  41    next 1.4;
  42
  43    [...]
  44
  45 For each revision summary, FileDataCollector.define_revision() is
  46 invoked, recording that revision's metadata in various variables of
  47 the FileDataCollector class instance.
  48
  49 After finishing the revision summaries, the parser invokes
  50 FileDataCollector.tree_completed(), which loops over the revision
  51 information stored, determining if there are instances where a higher
  52 revision was committed "before" a lower one (rare, but it can happen
  53 when there was clock skew on the repository machine).  If there are
  54 any, it "resyncs" the timestamp of the earlier rev to be just before
  55 that of the later rev, but saves the original timestamp in
  56 self.rev_data[blah][2], so we can later write out a record to the
  57 resync file indicating that an adjustment was made (this makes it
  58 possible to catch the other parts of this commit and resync them
  59 similarly, more details below).
  60
  61 Next, the parser encounters the *real* revision data, which has the
  62 log messages and file contents.  For each revision, it invokes
  63 FileDataCollector.set_revision_info(), which writes a new line to
  64 cvs2svn-data.revs.  The line is constructed by the CVSRevision class -
  65 one of its many roles. Here is an example:
  66
  67    3dc32955 5afe9b4ba41843d8eb52ae7db47a43eaa9573254 3dc32954 3dc32956 C 1.1 1.2 1.3 T F 1024 N * 0 0 foo/bar,v
  68
  69 The fields are:
  70
  71    1.  a fixed-width timestamp
  72    2.  a digest of the log message + author
  73    3.  a fixed-width timestamp indicating the timestamp of this
  74        revision's previous revision (or "*", if it's the first
  75        revision on this line of development).
  76    4.  a fixed-width timestamp indicating the timestamp of this
  77        revision's next revision (or "*", if it's the last revision on
  78        this line of development).
  79    5.  the type of change ("A"dd, "C"hange, or "D"elete)
  80    6.  the revision number of the previous revision along this line of
  81        development (or "*", if it's the first revision on this line of
  82        development).
  83    7.  the revision number
  84    8.  the revision number of the next revision along this line of
  85        development (or "*", if it's the last revision on this line of
  86        development).
  87    9.  "T" if the RCS file is in the Attic, "F" if it isn't.
  88    10. "T" if the RCS file has the executable bit set, "F" if not.
  89    11. The size of the RCS file, in bytes.
  90    12. "T" if this revision has non-empty deltatext, else "F" for empty.
  91    13. the RCS keyword substitution mode ("k", "b", etc), or "*" if none
  92    14. the branch on which this commit happened, or "*" if not on a branch
  93    15. the number of tags rooted at this revision (followed by their
  94        names, space-delimited)
  95    16. the number of branches rooted at this revision (followed by
  96        their names, space-delimited)
  97    17. the path of the RCS file in the repository
  98
  99 (Of course, in the above example, fields 15 and 16 are "0", so they have
 100 no additional data.)
 101
 102 Also, for resync'd revisions, a line like this is written out to
 103 'cvs2svn-data.resync':
 104
 105    3d6c1329 18a215a05abea1c6c155dcc7283b88ae7ce23502 3d6c1328
 106
 107 The fields are:
 108
 109    NEW_TIMESTAMP   DIGEST   OLD_TIMESTAMP
 110
 111 (The resync file will be explained later.)
 112
 113 That's it -- the RCS file is done.
 114
 115 When every RCS file is done, CollectRevsPass is complete, and:
 116
 117    - cvs2svn-data.revs contains a summary of every RCS file's
 118      revisions.  All the revisions for a given RCS file are grouped
 119      together, but note that the groups are in no particular order.
 120      In other words, you can't yet identify the commits from looking
 121      at these lines; a multi-file commit will be scattered all over
 122      the place.
 123
 124    - cvs2svn-data.resync contains a small amount of resync data, in
 125      no particular order.
 126
 127
 128 ResyncRevsPass (formerly called pass2)
 129 ==============
 130
 131 This is where the resync file is used.  The goal of this pass is to
 132 convert cvs2svn-data.revs to a new file, 'cvs2svn-data.c-revs' (clean
 133 revs).  It's the same as the original file, except for some resync'd
 134 timestamps.
 135
 136 First, read the whole resync file into a hash table that maps each
 137 author+log digest to a list of lists.  Each sublist represents one of
 138 the timestamp adjustments from CollectRevsPass, and looks like this:
 139
 140    [old_time_lower, old_time_upper, new_time]
 141
 142 The reason to map each digest to a list of sublists, instead of to one
 143 list, is that sometimes you'll get the same digest for unrelated
 144 commits (for example, the same author commits many times using the
 145 empty log message, or a log message that just says "Doc tweaks.").  So
 146 each digest may need to "fan out" to cover multiple commits, but
 147 without accidentally unifying those commits.
 148
 149 Now we loop over cvs2svn-data.revs, writing each line out to
 150 'cvs2svn-data.c-revs'.  Most lines are written out unchanged, but
 151 those whose digest matches some resync entry, and appear to be part of
 152 the same commit as one of the sublists in that entry, get tweaked.
 153 The tweak is to adjust the commit time of the line to the new_time,
 154 which is taken from the resync hash and results from the adjustment
 155 described in CollectRevsPass.
 156
 157 The way we figure out whether a given line needs to be tweaked is to
 158 loop over all the sublists, seeing if this commit's original time
 159 falls within the old<-->new time range for the current sublist.  If it
 160 does, we tweak the line before writing it out, and then conditionally
 161 adjust the sublist's range to account for the timestamp we just
 162 adjusted (since it could be an outlier).  Note that this could, in
 163 theory, result in separate commits being accidentally unified, since
 164 we might gradually adjust the two sides of the range such that they are
 165 eventually more than COMMIT_THRESHOLD seconds apart.  However, this is
 166 really a case of CVS not recording enough information to disambiguate
 167 the commits; we'd know we have a time range that exceeds the
 168 COMMIT_THRESHOLD, but we wouldn't necessarily know where to divide it
 169 up.  We could try some clever heuristic, but for now it's not
 170 important -- after all, we're talking about commits that weren't
 171 important enough to have a distinctive log message anyway, so does it
 172 really matter if a couple of them accidentally get unified?  Probably
 173 not.
 174
 175
 176 SortRevsPass (formerly called pass3)
 177 ============
 178
 179 This is where we deduce the changesets, that is, the grouping of file
 180 changes into single commits.
 181
 182 It's very simple -- run 'sort' on cvs2svn-data.c-revs, converting it
 183 to 'cvs2svn-data.s-revs'.  Because of the way the data is laid out,
 184 this causes commits with the same digest (that is, the same author and
 185 log message) to be grouped together.  Poof!  We now have the CVS
 186 changes grouped by logical commit.
 187
 188 In some cases, the changes in a given commit may be interleaved with
 189 other commits that went on at the same time, because the sort gives
 190 precedence to date before log digest.  However, CreateDatabasesPass
 191 detects this by seeing that the log digest is different, and
 192 reseparates the commits.
 193
 194
 195 CreateDatabasesPass (formerly called pass4):
 196 ===================
 197
 198 This pass has two primary objectives:
 199
 200 1. Create a database that maps CVSRevision unique keys to the actual
 201    CVSRevision string from the revs file (whose format is described
 202    above in CollectRevsPass).  This results in a database containing
 203    one key-value pair for each line in the revs file.  This gives us
 204    the ability to pass around these smaller keys instead of whole CVS
 205    revisions (which look like lines from the s-revs file).  See the
 206    CVSRevision class for more details on what the unique key is.
 207
 208 2. Find and create a database containing the last CVS revision that is
 209    a source (also referred to as an "opening" revision) for all
 210    symbolic names.  This will result in a database containing
 211    key-value pairs whose key is the unique key for a CVSRevision, and
 212    whose value is a list of symbolic names for which that CVSRevision
 213    is the last "opening."
 214
 215    The format for this file is:
 216
 217        cvs-symname-last-revs.db:
 218             Key                      Value
 219             CVS Revision             array of Symbolic names
 220
 221        For example:
 222
 223             1.38/foo/bar/baz.txt,v  --> [TAG11, BRANCH38]
 224             1.93/foo/qux/bat.c,v    --> [TAG39]
 225             1.4/foo/bar/baz.txt,v   --> [BRANCH48, BRANCH37]
 226             1.18/foo/bar/quux.txt,v --> [TAG320, TAG1178]
 227
 228
 229 AggregateRevsPass (formerly called pass5)
 230 =================
 231
 232 Primarily, this pass gathers CVS revisions into Subversion revisions
 233 (a Subversion revision is comprised of one or more CVS revisions)
 234 before we actually begin committing (where "committing" means either
 235 to a Subversion repository or to a dump file).
 236
 237 This pass does the following:
 238
 239 1. Creates a database file to map Subversion Revision numbers to their
 240    corresponding CVS Revisions (cvs2svn-svn-revnums-to-cvs-revs.db).
 241    Creates another database file to map CVS Revisions to their
 242    Subversion Revision numbers (cvs2svn-cvs-revs-to-svn-revnums.db).
 243
 244 2. When a file is copied to a symbolic name in cvs2svn, there are a
 245    range of valid Subversion revisions that we can copy the file from.
 246    The first valid Subversion revision number for a symbolic name is
 247    called the "Opening", and the first *invalid* Subversion revision
 248    number encountered after the "Opening" is called the "Closing".  In
 249    this pass, the SymbolingsLogger class writes one line to
 250    cvs2svn-symbolic-names.txt per CVS file, per symbolic name, per
 251    opening or closing.
 252
 253 3. For each CVS Revision in s-revs, we write out a line (for each
 254    symbolic name that it opens) to cvs2svn-symbolic-names.txt if it is
 255    the first possible source revision (the "opening" revision) for a
 256    copy to create a branch or tag, or if it is the last possible
 257    revision (the "closing" revision) for a copy to create a branch or
 258    tag.  Not every opening will have a corresponding closing.
 259
 260    The format of each line is:
 261
 262        SYMBOLIC_NAME SVN_REVNUM TYPE BRANCH_NAME CVS_PATH
 263
 264    For example:
 265
 266        MY_TAG1 234 O * foo/bar/baz.txt,v
 267        MY_BRANCH3 245 O * foo/qux/bat.c,v
 268        MY_TAG1 241 C MY_BRANCH1 foo/bar/baz.txt,v
 269        MY_BRANCH_BLAH 201 O MY_BRANCH1 foo/bar/quux.txt,v
 270
 271    Here is what the columns mean:
 272
 273    SYMBOLIC_NAME: The name of the branch or tag that starts or ends
 274                   in this CVS Revision (there can be multiples per
 275                   CVS rev).
 276
 277    SVN_REVNUM: The Subversion revision number that is the opening or
 278                closing for this SYMBOLIC_NAME.
 279
 280    TYPE: "O" for Openings and "C" for Closings.
 281
 282    BRANCH_NAME: The (uncleaned) branch name where this opening or
 283                 closing happened.  '*' denotes the default branch.
 284
 285    CVS_PATH: The CVS path where this opening or closing happened.
 286
 287    See SymbolingsLogger for more details.
 288
 289
 290 SortSymbolsPass (formerly called pass6)
 291 ===============
 292
 293 This pass merely sorts cvs2svn-symbolic-names.txt into
 294 cvs2svn-symbolic-names-s.txt.  This orders the file first by symbolic
 295 name, and second by Subversion revision number, thus grouping all
 296 openings and closings for each symbolic name together.
 297
 298
 299 IndexSymbolsPass (formerly called pass7)
 300 ================
 301
 302 This pass iterates through all the lines in
 303 cvs2svn-symbolic-names-s.txt, writing out a database file mapping
 304 SYMBOLIC_NAME to the file offset in SYMBOL_OPENINGS_CLOSINGS_SORTED
 305 where SYMBOLIC_NAME is first encountered.  This will allow us to seek
 306 to the various offsets in the file and sequentially read only the
 307 openings and closings that we need.
 308
 309
 310 OutputPass (formerly called pass8)
 311 ==========
 312
 313 This pass has very little "thinking" to do--it basically opens the
 314 svn-nums-to-cvs-revs.db and, starting with Subversion revision 2
 315 (revision 1 creates /trunk, /tags, and /branches), sequentially plays
 316 out all the commits to either a Subversion repository or to a
 317 dumpfile.
 318
 319 In --dump-only mode, the result of this pass is a Subversion
 320 repository dumpfile (suitable for input to 'svnadmin load').  The
 321 dumpfile is the data's last static stage: last chance to check over
 322 the data, run it through svndumpfilter, move the dumpfile to another
 323 machine, etc.
 324
 325 However, when not in --dump-only mode, no full dumpfile is created for
 326 subsequent load into a Subversion repository.  Instead, miniature
 327 dumpfiles representing a single revision are created, loaded into the
 328 repository, and then removed.
 329
 330 In both modes, the dumpfile revisions are created by walking through
 331 cvs2svn-data.s-revs.
 332
 333
 334                   ===============================
 335                       Branches and Tags Plan.
 336                   ===============================
 337
 338 This pass is also where tag and branch creation is done.  Since
 339 subversion does tags and branches by copying from existing revisions
 340 (then maybe editing the copy, making subcopies underneath, etc), the
 341 big question for cvs2svn is how to achieve the minimum number of
 342 operations per creation.  For example, if it's possible to get the
 343 right tag by just copying revision 53, then it's better to do that
 344 than, say, copying revision 51 and then sub-copying in bits of
 345 revision 52 and 53.
 346
 347 Also, since CVS does not version symbolic names, there is the
 348 secondary question of *when* to create a particular tag or branch.
 349 For example, a tag might have been made at any time after the youngest
 350 commit included in it, or might even have been made piecemeal; and the
 351 same is true for a branch, with the added constraint that for any
 352 particular file, the branch must have been created before the first
 353 commit on the branch.
 354
 355 Answering the second question first: cvs2svn creates tags as soon as
 356 possible and branches as late as possible.
 357
 358 Tags are created as soon as cvs2svn encounters the last CVS Revision
 359 that is a source for that tag.  The whole tag is created in one
 360 Subversion commit.
 361
 362 For branches, this is "just in time" creation -- the moment it sees
 363 the first commit on a branch, it snaps the entire branch into
 364 existence (or as much of it as possible), and then outputs the branch
 365 commit.
 366
 367 The reason we say "as much of it as possible" is that it's possible to
 368 have a branch where some files have branch commits occuring earlier
 369 than the other files even have the source revisions from which the
 370 branch sprouts (this can happen if the branch was created piecemeal,
 371 for example).  In this case, we create as much of the branch as we
 372 can, that is, as much of it as there are source revisions available to
 373 copy, and leave the rest for later.  "Later" might mean just until
 374 other branch commits come in, or else during a cleanup stage that
 375 happens at the end of this pass (about which more later).
 376
 377 How just-in-time branch creation works:
 378
 379 In order to make the "best" set of copies/deletes when creating a
 380 branch, cvs2svn keeps track of two sets of trees while it's making
 381 commits:
 382
 383    1. A skeleton mirror of the subversion repository, that is, an
 384       array of revisions, with a tree hanging off each revision.  (The
 385       "array" is actually implemented as an anydbm database itself,
 386       mapping string representations of numbers to root keys.)
 387
 388    2. A tree for each CVS symbolic name, and the svn file/directory
 389       revisions from which various parts of that tree could be copied.
 390
 391 Both tree sets live in anydbm databases, using the same basic schema:
 392 unique keys map to marshal.dumps() representations of dictionaries,
 393 which in turn map entry names to other unique keys:
 394
 395    root_key  ==> { entryname1 : entrykey1, entryname2 : entrykey2, ... }
 396    entrykey1 ==> { entrynameX : entrykeyX, ... }
 397    entrykey2 ==> { entrynameY : entrykeyY, ... }
 398    entrykeyX ==> { etc, etc ...}
 399    entrykeyY ==> { etc, etc ...}
 400
 401 (The leaf nodes -- files -- are also dictionaries, for simplicity.)
 402
 403 The repository mirror allows cvs2svn to remember what paths exist in
 404 what revisions.
 405
 406 For details on how branches and tags are created, please see the
 407 docstring the SymbolingsLogger class (and its methods).
 408
 409 -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*-
 410 - -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -
 411 -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*-
 412
 413 Some older notes and ideas about cvs2svn.  Not deleted, because they
 414 may contain suggestions for future improvements in design.
 415
 416 -----------------------------------------------------------------------
 417
 418 An email from John Gardiner Myers <jgmyers@speakeasy.net> about some
 419 considerations for the tool.
 420
 421 ------
 422 From: John Gardiner Myers <jgmyers@speakeasy.net>
 423 Subject: Thoughts on CVS to SVN conversion
 424 To: gstein@lyra.org
 425 Date: Sun, 15 Apr 2001 17:47:10 -0700
 426
 427 Some things you may want to consider for a CVS to SVN conversion utility:
 428
 429 If converting a CVS repository to SVN takes days, it would be good for
 430 the conversion utility to keep its progress state on disk.  If the
 431 conversion fails halfway through due to a network outage or power
 432 failure, that would allow the conversion to be resumed where it left off
 433 instead of having to start over from an empty SVN repository.
 434
 435 It is a short step from there to allowing periodic updates of a
 436 read-only SVN repository from a read/write CVS repository.  This allows
 437 the more relaxed conversion procedure:
 438
 439 1) Create SVN repository writable only by the conversion tool.
 440 2) Update SVN repository from CVS repository.
 441 3) Announce the time of CVS to SVN cutover.
 442 4) Repeat step (2) as needed.
 443 5) Disable commits to CVS repository, making it read-only.
 444 6) Repeat step (2).
 445 7) Enable commits to SVN repository.
 446 8) Wait for developers to move their workspaces to SVN.
 447 9) Decomission the CVS repository.
 448
 449 You may forward this message or parts of it as you seem fit.
 450 ------
 451
 452 -----------------------------------------------------------------------
 453
 454 Further design thoughts from Greg Stein <gstein@lyra.org>
 455
 456 * timestamp the beginning of the process. ignore any commits that
 457   occur after that timestamp; otherwise, you could miss portions of a
 458   commit (e.g. scan A; commit occurs to A and B; scan B; create SVN
 459   revision for items in B; we missed A)
 460
 461 * the above timestamp can also be used for John's "grab any updates
 462   that were missed in the previous pass."
 463
 464 * for each file processed, watch out for simultaneous commits. this
 465   may cause a problem during the reading/scanning/parsing of the file,
 466   or the parse succeeds but the results are garbaged. this could be
 467   fixed with a CVS lock, but I'd prefer read-only access.
 468
 469   algorithm: get the mtime before opening the file. if an error occurs
 470   during reading, and the mtime has changed, then restart the file. if
 471   the read is successful, but the mtime changed, then restart the
 472   file.
 473
 474 * use a separate log to track unique branches and non-branched forks
 475   of revision history (Q: is it possible to create, say, 1.4.1.3
 476   without a "real" branch?). this log can then be used to create a
 477   /branches/ directory in the SVN repository.
 478
 479   Note: we want to determine some way to coalesce branches across
 480   files. It can't be based on name, though, since the same branch name
 481   could be used in multiple places, yet they are semantically
 482   different branches. Given files R, S, and T with branch B, we can
 483   tie those files' branch B into a "semantic group" whenever we see
 484   commit groups on a branch touching multiple files. Files that are
 485   have a (named) branch but no commits on it are simply ignored. For
 486   each "semantic group" of a branch, we'd create a branch based on
 487   their common ancestor, then make the changes on the children as
 488   necessary. For single-file commits to a branch, we could use
 489   heuristics (pathname analysis) to add these to a group (and log what
 490   we did), or we could put them in a "reject" kind of file for a human
 491   to tell us what to do (the human would edit a config file of some
 492   kind to instruct the converter).
 493
 494 * if we have access to the CVSROOT/history, then we could process tags
 495   properly. otherwise, we can only use heuristics or configuration
 496   info to group up tags (branches can use commits; there are no
 497   commits associated with tags)
 498
 499 * ideally, we store every bit of data from the ,v files to enable a
 500   complete restoration of the CVS repository. this could be done by
 501   storing properties with CVS revision numbers and stuff (i.e. all
 502   metadata not already embodied by SVN would go into properties)
 503
 504 * how do we track the "states"? I presume "dead" is simply deleting
 505   the entry from SVN. what are the other legal states, and do we need
 506   to do anything with them?
 507
 508 * where do we put the "description"? how about locks, access list,
 509   keyword flags, etc.
 510
 511 * note that using something like the SourceForge repository will be an
 512   ideal test case. people *move* their repositories there, which means
 513   that all kinds of stuff can be found in those repositories, from
 514   wherever people used to run them, and under whatever development
 515   policies may have been used.
 516
 517   For example: I found one of the projects with a "permissions 644;"
 518   line in the "gnuplot" repository.  Most RCS releases issue warnings
 519   about that (although they properly handle/skip the lines), and CVS
 520   ignores RCS newphrases altogether.