design-notes.txt

   1                          How cvs2svn Works
   2                          =================
   3
   4 A cvs2svn run consists of eight passes.  Each pass saves the data it
   5 produces to files on disk, so that a) we don't hold huge amounts of
   6 state in memory, and b) the conversion process is resumable.
   7
   8 Pass 1:
   9 =======
  10
  11 The goal of this pass is to write to 'cvs2svn-data.revs' a summary of
  12 all the revisions for each RCS file.  Each revision will be
  13 represented by one line.  At the end of this stage, the revisions
  14 (i.e., the lines) will be grouped by RCS file, not by logical commits.
  15
  16 We walk over the repository, processing each RCS file with
  17 rcsparse.parse(), using cvs2svn's CollectData class, which is a
  18 subclass of rcsparse.Sink(), the parser's callback class.  For each
  19 RCS file, the first thing the parser encounters is the administrative
  20 header, including the head revision, the principal branch, symbolic
  21 names, RCS comments, etc.  The main thing that happens here is that
  22 CollectData.define_tag() is invoked on each symbolic name and its
  23 attached revision, so all the tags and branches of this file get
  24 collected.
  25
  26 Next, the parser hits the revision summary section.  That's the part
  27 of the RCS file that looks like this:
  28
  29    1.6
  30    date 2002.06.12.04.54.12;    author captnmark;       state Exp;
  31    branches
  32         1.6.2.1;
  33    next 1.5;
  34
  35    1.5
  36    date 2002.05.28.18.02.11;    author captnmark;       state Exp;
  37    branches;
  38    next 1.4;
  39
  40    [...]
  41
  42 For each revision summary, CollectData.define_revision() is invoked,
  43 recording that revision's metadata in various variables of the
  44 CollectData class instance.
  45
  46 After finishing the revision summaries, the parser invokes
  47 CollectData.tree_completed(), which loops over the revision
  48 information stored, determining if there are instances where a higher
  49 revision was committed "before" a lower one (rare, but it can happen
  50 when there was clock skew on the repository machine).  If there are
  51 any, it "resyncs" the timestamp of the earlier rev to be just before
  52 that of the later rev, but saves the original timestamp in
  53 self.rev_data[blah][2], so we can later write out a record to the
  54 resync file indicating that an adjustment was made (this makes it
  55 possible to catch the other parts of this commit and resync them
  56 similarly, more details below).
  57
  58 Next, the parser encounters the *real* revision data, which has the
  59 log messages and file contents.  For each revision, it invokes
  60 CollectData.set_revision_info(), which writes a new line to
  61 cvs2svn-data.revs.  The line is constructed by the CVSRevision class -
  62 one of its many roles. Here is an example:
  63
  64    3dc32955 5afe9b4ba41843d8eb52ae7db47a43eaa9573254 3dc32954 3dc32956 C 1.1 1.2 1.3 1 1 1024 N * 0 0 foo/bar,v
  65
  66 The fields are:
  67
  68    1.  a fixed-width timestamp
  69    2.  a digest of the log message + author
  70    3.  a fixed-width timestamp indicating the timestamp of this
  71        revision's previous revision (or "*", if it's the first
  72        revision on this line of development).
  73    4.  a fixed-width timestamp indicating the timestamp of this
  74        revision's next revision (or "*", if it's the last revision on
  75        this line of development).
  76    5.  the type of change ("A"dd, "C"hange, or "D"elete)
  77    6.  the revision number of the previous revision along this line of
  78        development (or "*", if it's the first revision on this line of
  79        development).
  80    7.  the revision number
  81    8.  the revision number of the next revision along this line of
  82        development (or "*", if it's the last revision on this line of
  83        development).
  84    9.  1 if the RCS file is in the Attic, "*" if it isn't.
  85    10. 1 if the RCS file has the executable bit set, "*" if not.
  86    12. The size of the RCS file, in bytes.
  87    12. "N" if this revision has non-empty deltatext, else "E" for empty
  88    13. the RCS keyword substitution mode ("k", "b", etc), or "*" if none
  89    14. the branch on which this commit happened, or "*" if not on a branch
  90    15. the number of tags rooted at this revision (followed by their
  91        names, space-delimited)
  92    16. the number of branches rooted at this revision (followed by
  93        their names, space-delimited)
  94    17. the path of the RCS file in the repository
  95
  96 (Of course, in the above example, fields 15 and 16 are "0", so they have
  97 no additional data.)
  98
  99 Also, for resync'd revisions, a line like this is written out to
 100 'cvs2svn-data.resync':
 101
 102    3d6c1329 18a215a05abea1c6c155dcc7283b88ae7ce23502 3d6c1328
 103
 104 The fields are:
 105
 106    NEW_TIMESTAMP   DIGEST   OLD_TIMESTAMP
 107
 108 (The resync file will be explained later.)
 109
 110 That's it -- the RCS file is done.
 111
 112 When every RCS file is done, Pass 1 is complete, and:
 113
 114    - cvs2svn-data.revs contains a summary of every RCS file's
 115      revisions.  All the revisions for a given RCS file are grouped
 116      together, but note that the groups are in no particular order.
 117      In other words, you can't yet identify the commits from looking
 118      at these lines; a multi-file commit will be scattered all over
 119      the place.
 120
 121    - cvs2svn-data.resync contains a small amount of resync data, in
 122      no particular order.
 123
 124 Pass 2:
 125 =======
 126
 127 This is where the resync file is used.  The goal of this pass is to
 128 convert cvs2svn-data.revs to a new file, 'cvs2svn-data.c-revs' (clean
 129 revs).  It's the same as the original file, except for some resync'd
 130 timestamps.
 131
 132 First, read the whole resync file into a hash table that maps each
 133 author+log digest to a list of lists.  Each sublist represents one of
 134 the timestamp adjustments from Pass 1, and looks like this:
 135
 136    [old_time_lower, old_time_upper, new_time]
 137
 138 The reason to map each digest to a list of sublists, instead of to one
 139 list, is that sometimes you'll get the same digest for unrelated
 140 commits (for example, the same author commits many times using the
 141 empty log message, or a log message that just says "Doc tweaks.").  So
 142 each digest may need to "fan out" to cover multiple commits, but
 143 without accidentally unifying those commits.
 144
 145 Now we loop over cvs2svn-data.revs, writing each line out to
 146 'cvs2svn-data.c-revs'.  Most lines are written out unchanged, but
 147 those whose digest matches some resync entry, and appear to be part of
 148 the same commit as one of the sublists in that entry, get tweaked.
 149 The tweak is to adjust the commit time of the line to the new_time,
 150 which is taken from the resync hash and results from the adjustment
 151 described in Pass 1.
 152
 153 The way we figure out whether a given line needs to be tweaked is to
 154 loop over all the sublists, seeing if this commit's original time
 155 falls within the old<-->new time range for the current sublist.  If it
 156 does, we tweak the line before writing it out, and then conditionally
 157 adjust the sublist's range to account for the timestamp we just
 158 adjusted (since it could be an outlier).  Note that this could, in
 159 theory, result in separate commits being accidentally unified, since
 160 we might gradually adjust the two sides of the range such that they are
 161 eventually more than COMMIT_THRESHOLD seconds apart.  However, this is
 162 really a case of CVS not recording enough information to disambiguate
 163 the commits; we'd know we have a time range that exceeds the
 164 COMMIT_THRESHOLD, but we wouldn't necessarily know where to divide it
 165 up.  We could try some clever heuristic, but for now it's not
 166 important -- after all, we're talking about commits that weren't
 167 important enough to have a distinctive log message anyway, so does it
 168 really matter if a couple of them accidentally get unified?  Probably
 169 not.
 170
 171 Pass 3:
 172 =======
 173
 174 This is where we deduce the changesets, that is, the grouping of file
 175 changes into single commits.
 176
 177 It's very simple -- run 'sort' on cvs2svn-data.c-revs, converting it
 178 to 'cvs2svn-data.s-revs'.  Because of the way the data is laid out,
 179 this causes commits with the same digest (that is, the same author and
 180 log message) to be grouped together.  Poof!  We now have the CVS
 181 changes grouped by logical commit.
 182
 183 In some cases, the changes in a given commit may be interleaved with
 184 other commits that went on at the same time, because the sort gives
 185 precedence to date before log digest.  However, Pass 4 detects this by
 186 seeing that the log digest is different, and reseparates the commits.
 187
 188 Pass 4:
 189 =======
 190
 191 This pass has two primary objectives:
 192
 193 1. Create a database that maps CVSRevision unique keys to the actual
 194    CVSRevision string from the revs file (whose format is described
 195    above in pass 1).  This results in a database containing one
 196    key-value pair for each line in the revs file.  This gives us the
 197    ability to pass around these smaller keys instead of whole CVS
 198    revisions (which look like lines from the s-revs file).  See the
 199    CVSRevision class for more details on what the unique key is.
 200
 201 2. Find and create a database containing the last CVS revision that is
 202    a source (also referred to as an "opening" revision) for all
 203    symbolic names.  This will result in a database containing
 204    key-value pairs whose key is the unique key for a CVSRevision, and
 205    whose value is a list of symbolic names for which that CVSRevision
 206    is the last "opening."
 207
 208    The format for this file is:
 209
 210        cvs-symname-last-revs.db:
 211             Key                      Value
 212             CVS Revision             array of Symbolic names
 213
 214        For example:
 215
 216             1.38/foo/bar/baz.txt,v  --> [TAG11, BRANCH38]
 217             1.93/foo/qux/bat.c,v    --> [TAG39]
 218             1.4/foo/bar/baz.txt,v   --> [BRANCH48, BRANCH37]
 219             1.18/foo/bar/quux.txt,v --> [TAG320, TAG1178]
 220
 221 Pass 5:
 222 =======
 223
 224 Primarily, this pass gathers CVS revisions into Subversion revisions
 225 (a Subversion revision is comprised of one or more CVS revisions)
 226 before we actually begin committing (where "committing" means either
 227 to a Subversion repository or to a dump file).
 228
 229 This pass does the following:
 230
 231 1. Creates a database file to map Subversion Revision numbers to their
 232    corresponding CVS Revisions (cvs2svn-svn-revnums-to-cvs-revs.db).
 233    Creates another database file to map CVS Revisions to their
 234    Subversion Revision numbers (cvs2svn-cvs-revs-to-svn-revnums.db).
 235
 236 2. When a file is copied to a symbolic name in cvs2svn, there are a
 237    range of valid Subversion revisions that we can copy the file from.
 238    The first valid Subversion revision number for a symbolic name is
 239    called the "Opening", and the first *invalid* Subversion revision
 240    number encountered after the "Opening" is called the "Closing".  In
 241    this pass, the SymbolingsLogger class writes one line to
 242    cvs2svn-symbolic-names.txt per CVS file, per symbolic name, per
 243    opening or closing.
 244
 245 3. For each CVS Revision in s-revs, we write out a line (for each
 246    symbolic name that it opens) to cvs2svn-symbolic-names.txt if it is
 247    the first possible source revision (the "opening" revision) for a
 248    copy to create a branch or tag, or if it is the last possible
 249    revision (the "closing" revision) for a copy to create a branch or
 250    tag.  Not every opening will have a corresponding closing.
 251
 252    The format of each line is:
 253
 254        SYMBOLIC_NAME SVN_REVNUM TYPE BRANCH_NAME CVS_PATH
 255
 256    For example:
 257
 258        MY_TAG1 234 O * foo/bar/baz.txt,v
 259        MY_BRANCH3 245 O * foo/qux/bat.c,v
 260        MY_TAG1 241 C MY_BRANCH1 foo/bar/baz.txt,v
 261        MY_BRANCH_BLAH 201 O MY_BRANCH1 foo/bar/quux.txt,v
 262
 263    Here is what the columns mean:
 264
 265    SYMBOLIC_NAME: The name of the branch or tag that starts or ends
 266                   in this CVS Revision (there can be multiples per
 267                   CVS rev).
 268
 269    SVN_REVNUM: The Subversion revision number that is the opening or
 270                closing for this SYMBOLIC_NAME.
 271
 272    TYPE: "O" for Openings and "C" for Closings.
 273
 274    BRANCH_NAME: The (uncleaned) branch name where this opening or
 275                 closing happened.  '*' denotes the default branch.
 276
 277    CVS_PATH: The CVS path where this opening or closing happened.
 278
 279    See SymbolingsLogger for more details.
 280
 281 Pass 6:
 282 =======
 283
 284 This pass merely sorts cvs2svn-symbolic-names.txt into
 285 cvs2svn-symbolic-names-s.txt.  This orders the file first by symbolic
 286 name, and second by Subversion revision number, thus grouping all
 287 openings and closings for each symbolic name together.
 288
 289 Pass 7:
 290 =======
 291
 292 This pass iterates through all the lines in
 293 cvs2svn-symbolic-names-s.txt, writing out a database file mapping
 294 SYMBOLIC_NAME to the file offset in SYMBOL_OPENINGS_CLOSINGS_SORTED
 295 where SYMBOLIC_NAME is first encountered.  This will allow us to seek
 296 to the various offsets in the file and sequentially read only the
 297 openings and closings that we need.
 298
 299 Pass 8:
 300 =======
 301
 302 The 8th pass has very little "thinking" to do--it basically opens the
 303 svn-nums-to-cvs-revs.db and, starting with Subversion revision 2
 304 (revision 1 creates /trunk, /tags, and /branches), sequentially plays
 305 out all the commits to either a Subversion repository or to a
 306 dumpfile.
 307
 308 In --dump-only mode, the result of this pass is a Subversion
 309 repository dumpfile (suitable for input to 'svnadmin load').  The
 310 dumpfile is the data's last static stage: last chance to check over
 311 the data, run it through svndumpfilter, move the dumpfile to another
 312 machine, etc.
 313
 314 However, when not in --dump-only mode, no full dumpfile is created for
 315 subsequent load into a Subversion repository.  Instead, miniature
 316 dumpfiles representing a single revision are created, loaded into the
 317 repository, and then removed.
 318
 319 In both modes, the dumpfile revisions are created by walking through
 320 cvs2svn-data.s-revs.
 321
 322                   ===============================
 323                       Branches and Tags Plan.
 324                   ===============================
 325
 326 This pass is also where tag and branch creation is done.  Since
 327 subversion does tags and branches by copying from existing revisions
 328 (then maybe editing the copy, making subcopies underneath, etc), the
 329 big question for cvs2svn is how to achieve the minimum number of
 330 operations per creation.  For example, if it's possible to get the
 331 right tag by just copying revision 53, then it's better to do that
 332 than, say, copying revision 51 and then sub-copying in bits of
 333 revision 52 and 53.
 334
 335 Also, since CVS does not version symbolic names, there is the
 336 secondary question of *when* to create a particular tag or branch.
 337 For example, a tag might have been made at any time after the youngest
 338 commit included in it, or might even have been made piecemeal; and the
 339 same is true for a branch, with the added constraint that for any
 340 particular file, the branch must have been created before the first
 341 commit on the branch.
 342
 343 Answering the second question first: cvs2svn creates tags as soon as
 344 possible and branches as late as possible.
 345
 346 Tags are created as soon as cvs2svn encounters the last CVS Revision
 347 that is a source for that tag.  The whole tag is created in one
 348 Subversion commit.
 349
 350 For branches, this is "just in time" creation -- the moment it sees
 351 the first commit on a branch, it snaps the entire branch into
 352 existence (or as much of it as possible), and then outputs the branch
 353 commit.
 354
 355 The reason we say "as much of it as possible" is that it's possible to
 356 have a branch where some files have branch commits occuring earlier
 357 than the other files even have the source revisions from which the
 358 branch sprouts (this can happen if the branch was created piecemeal,
 359 for example).  In this case, we create as much of the branch as we
 360 can, that is, as much of it as there are source revisions available to
 361 copy, and leave the rest for later.  "Later" might mean just until
 362 other branch commits come in, or else during a cleanup stage that
 363 happens at the end of this pass (about which more later).
 364
 365 How just-in-time branch creation works:
 366
 367 In order to make the "best" set of copies/deletes when creating a
 368 branch, cvs2svn keeps track of two sets of trees while it's making
 369 commits:
 370
 371    1. A skeleton mirror of the subversion repository, that is, an
 372       array of revisions, with a tree hanging off each revision.  (The
 373       "array" is actually implemented as an anydbm database itself,
 374       mapping string representations of numbers to root keys.)
 375
 376    2. A tree for each CVS symbolic name, and the svn file/directory
 377       revisions from which various parts of that tree could be copied.
 378
 379 Both tree sets live in anydbm databases, using the same basic schema:
 380 unique keys map to marshal.dumps() representations of dictionaries,
 381 which in turn map entry names to other unique keys:
 382
 383    root_key  ==> { entryname1 : entrykey1, entryname2 : entrykey2, ... }
 384    entrykey1 ==> { entrynameX : entrykeyX, ... }
 385    entrykey2 ==> { entrynameY : entrykeyY, ... }
 386    entrykeyX ==> { etc, etc ...}
 387    entrykeyY ==> { etc, etc ...}
 388
 389 (The leaf nodes -- files -- are also dictionaries, for simplicity.)
 390
 391 The repository mirror allows cvs2svn to remember what paths exist in
 392 what revisions.
 393
 394 For details on how branches and tags are created, please see the
 395 docstring the SymbolingsLogger class (and its methods).
 396
 397 -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*-
 398 - -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -
 399 -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*-
 400
 401 Some older notes and ideas about cvs2svn.  Not deleted, because they
 402 may contain suggestions for future improvements in design.
 403
 404 -----------------------------------------------------------------------
 405
 406 An email from John Gardiner Myers <jgmyers@speakeasy.net> about some
 407 considerations for the tool.
 408
 409 ------
 410 From: John Gardiner Myers <jgmyers@speakeasy.net>
 411 Subject: Thoughts on CVS to SVN conversion
 412 To: gstein@lyra.org
 413 Date: Sun, 15 Apr 2001 17:47:10 -0700
 414
 415 Some things you may want to consider for a CVS to SVN conversion utility:
 416
 417 If converting a CVS repository to SVN takes days, it would be good for
 418 the conversion utility to keep its progress state on disk.  If the
 419 conversion fails halfway through due to a network outage or power
 420 failure, that would allow the conversion to be resumed where it left off
 421 instead of having to start over from an empty SVN repository.
 422
 423 It is a short step from there to allowing periodic updates of a
 424 read-only SVN repository from a read/write CVS repository.  This allows
 425 the more relaxed conversion procedure:
 426
 427 1) Create SVN repository writable only by the conversion tool.
 428 2) Update SVN repository from CVS repository.
 429 3) Announce the time of CVS to SVN cutover.
 430 4) Repeat step (2) as needed.
 431 5) Disable commits to CVS repository, making it read-only.
 432 6) Repeat step (2).
 433 7) Enable commits to SVN repository.
 434 8) Wait for developers to move their workspaces to SVN.
 435 9) Decomission the CVS repository.
 436
 437 You may forward this message or parts of it as you seem fit.
 438 ------
 439
 440 -----------------------------------------------------------------------
 441
 442 Further design thoughts from Greg Stein <gstein@lyra.org>
 443
 444 * timestamp the beginning of the process. ignore any commits that
 445   occur after that timestamp; otherwise, you could miss portions of a
 446   commit (e.g. scan A; commit occurs to A and B; scan B; create SVN
 447   revision for items in B; we missed A)
 448
 449 * the above timestamp can also be used for John's "grab any updates
 450   that were missed in the previous pass."
 451
 452 * for each file processed, watch out for simultaneous commits. this
 453   may cause a problem during the reading/scanning/parsing of the file,
 454   or the parse succeeds but the results are garbaged. this could be
 455   fixed with a CVS lock, but I'd prefer read-only access.
 456
 457   algorithm: get the mtime before opening the file. if an error occurs
 458   during reading, and the mtime has changed, then restart the file. if
 459   the read is successful, but the mtime changed, then restart the
 460   file.
 461
 462 * use a separate log to track unique branches and non-branched forks
 463   of revision history (Q: is it possible to create, say, 1.4.1.3
 464   without a "real" branch?). this log can then be used to create a
 465   /branches/ directory in the SVN repository.
 466
 467   Note: we want to determine some way to coalesce branches across
 468   files. It can't be based on name, though, since the same branch name
 469   could be used in multiple places, yet they are semantically
 470   different branches. Given files R, S, and T with branch B, we can
 471   tie those files' branch B into a "semantic group" whenever we see
 472   commit groups on a branch touching multiple files. Files that are
 473   have a (named) branch but no commits on it are simply ignored. For
 474   each "semantic group" of a branch, we'd create a branch based on
 475   their common ancestor, then make the changes on the children as
 476   necessary. For single-file commits to a branch, we could use
 477   heuristics (pathname analysis) to add these to a group (and log what
 478   we did), or we could put them in a "reject" kind of file for a human
 479   to tell us what to do (the human would edit a config file of some
 480   kind to instruct the converter).
 481
 482 * if we have access to the CVSROOT/history, then we could process tags
 483   properly. otherwise, we can only use heuristics or configuration
 484   info to group up tags (branches can use commits; there are no
 485   commits associated with tags)
 486
 487 * ideally, we store every bit of data from the ,v files to enable a
 488   complete restoration of the CVS repository. this could be done by
 489   storing properties with CVS revision numbers and stuff (i.e. all
 490   metadata not already embodied by SVN would go into properties)
 491
 492 * how do we track the "states"? I presume "dead" is simply deleting
 493   the entry from SVN. what are the other legal states, and do we need
 494   to do anything with them?
 495
 496 * where do we put the "description"? how about locks, access list,
 497   keyword flags, etc.
 498
 499 * note that using something like the SourceForge repository will be an
 500   ideal test case. people *move* their repositories there, which means
 501   that all kinds of stuff can be found in those repositories, from
 502   wherever people used to run them, and under whatever development
 503   policies may have been used.
 504
 505   For example: I found one of the projects with a "permissions 644;"
 506   line in the "gnuplot" repository.  Most RCS releases issue warnings
 507   about that (although they properly handle/skip the lines), and CVS
 508   ignores RCS newphrases altogether.