design-notes.txt

   1                          How cvs2svn Works
   2                          =================
   3
   4 A cvs2svn run consists of eight passes.  Each pass saves the data it
   5 produces to files on disk, so that a) we don't hold huge amounts of
   6 state in memory, and b) the conversion process is resumable.
   7
   8 CollectRevsPass (formerly called pass1)
   9 ===============
  10
  11 The goal of this pass is to write a summary of each CVS file as a
  12 pickled CVSFile to 'cvs2svn-cvs-files.db', and a summary of each CVS
  13 file revision as a pickled CVSRevision to 'cvs2svn-cvs-revs.db'.  In
  14 each case, items are assigned an arbitrary key that is used to refer
  15 to them.
  16
  17 We walk over the repository, collecting data about the RCS files into
  18 an instance of CollectData.  Each RCS file is processed with
  19 rcsparse.parse(), which invokes callbacks from an instance of
  20 cvs2svn's _FileDataCollector class (which is a subclass of
  21 rcsparse.Sink).
  22
  23 For each RCS file, the first thing the parser encounters is the
  24 administrative header, including the head revision, the principal
  25 branch, symbolic names, RCS comments, etc.  The main thing that
  26 happens here is that _FileDataCollector.define_tag() is invoked on
  27 each symbolic name and its attached revision, so all the tags and
  28 branches of this file get collected.  When this stage is done, the
  29 parser invokes admin_completed(), which writes the CVSFile to the
  30 database.
  31
  32 Next, the parser hits the revision summary section.  That's the part
  33 of the RCS file that looks like this:
  34
  35    1.6
  36    date 2002.06.12.04.54.12;    author captnmark;       state Exp;
  37    branches
  38         1.6.2.1;
  39    next 1.5;
  40
  41    1.5
  42    date 2002.05.28.18.02.11;    author captnmark;       state Exp;
  43    branches;
  44    next 1.4;
  45
  46    [...]
  47
  48 For each revision summary, _FileDataCollector.define_revision() is
  49 invoked, recording that revision's metadata in various variables of
  50 the _FileDataCollector class instance.
  51
  52 After finishing the revision summaries, the parser invokes
  53 _FileDataCollector.tree_completed(), which loops over the revision
  54 information stored, determining if there are instances where a higher
  55 revision was committed "before" a lower one (rare, but it can happen
  56 when there was clock skew on the repository machine).  If there are
  57 any, it "resyncs" the timestamp of the earlier rev to be just before
  58 that of the later rev, but saves the original timestamp in
  59 self._rev_data[blah].original_timestamp, so we can later write out a
  60 record to the resync file indicating that an adjustment was made (this
  61 makes it possible to catch the other parts of this commit and resync
  62 them similarly; more details below).
  63
  64 Next, the parser encounters the *real* revision data, which has the
  65 log messages and file contents.  For each revision, it invokes
  66 _FileDataCollector.set_revision_info(), which writes a record to
  67 'cvs2svn-cvs-revs.db'.
  68
  69 Also, for resync'd revisions, a line like this is written out to
  70 'cvs2svn-resync.txt':
  71
  72    3d6c1329 18a215a05abea1c6c155dcc7283b88ae7ce23502 3d6c1328
  73
  74 The fields are:
  75
  76    NEW_TIMESTAMP   DIGEST   OLD_TIMESTAMP
  77
  78 (The resync file will be explained later.)
  79
  80 That's it -- the RCS file is done.
  81
  82 When every CVS file is done, CollectRevsPass is complete, and:
  83
  84    - 'cvs2svn-cvs-files.db' contains a record of every CVS file.
  85
  86    - 'cvs2svn-cvs-revs.db' contains a summary of every revision to
  87      every CVS file, including a reference to the corresponding CVS
  88      file record in 'cvs2svn-cvs-files.db'.  The order of the
  89      revisions is arbitrary.  In other words, a multi-file commit will
  90      be scattered all over the place.
  91
  92    - 'cvs2svn-a-revs.txt' contains a list of CVSRevision keys that are
  93      in 'cvs2svn-cvs-revs.db', in the order that they were written.
  94
  95    - 'cvs2svn-resync.txt' contains a small amount of resync data, in
  96      no particular order.
  97
  98    - 'cvs2svn-branches.txt' ???
  99    - 'cvs2svn-tags.txt' ???
 100    - 'cvs2svn-metadata.db' ???
 101
 102
 103 ResyncRevsPass (formerly called pass2)
 104 ==============
 105
 106 This is where the resync file is used.  The goal of this pass is to
 107 output the information from cvs2svn-cvs-revs.db to a new file,
 108 'cvs2svn-cvs-revs-resync.db' (clean revs).  It has the same content as
 109 the original file, except for some resync'd timestamps.
 110
 111 First, read the whole resync file into a hash table that maps each
 112 author+log digest to a list of lists.  Each sublist represents one of
 113 the timestamp adjustments from CollectRevsPass, and looks like this:
 114
 115    [old_time_lower, old_time_upper, new_time]
 116
 117 The reason to map each digest to a list of sublists, instead of to one
 118 list, is that sometimes you'll get the same digest for unrelated
 119 commits (for example, the same author commits many times using the
 120 empty log message, or a log message that just says "Doc tweaks.").  So
 121 each digest may need to "fan out" to cover multiple commits, but
 122 without accidentally unifying those commits.
 123
 124 Now we loop over 'cvs2svn-cvs-revs.db', and for each record write a
 125 line to 'cvs2svn-data.c-revs.txt'.  Each line of this file looks like
 126 this:
 127
 128    3dc32955 5afe9b4ba41843d8eb52ae7db47a43eaa9573254 12ab
 129
 130 The fields are:
 131
 132    1.  a fixed-width timestamp
 133    2.  a digest of the log message + author
 134    3.  the integer unique ID for this CVSRevision, as a hexadecimal
 135        string.
 136
 137 Any CVSRevision records in 'cvs2svn-cvs-revs.db' whose digest matches
 138 some resync entry and appear to be part of the same commit as one of
 139 the sublists in that entry, get tweaked.  The tweak is to adjust the
 140 commit time of the line to the new_time, which is taken from the
 141 resync hash and results from the adjustment described in
 142 CollectRevsPass.
 143
 144 The way we figure out whether a given line needs to be tweaked is to
 145 loop over all the sublists, seeing if this commit's original time
 146 falls within the old<-->new time range for the current sublist.  If it
 147 does, we tweak the line before writing it out, and then conditionally
 148 adjust the sublist's range to account for the timestamp we just
 149 adjusted (since it could be an outlier).  Note that this could, in
 150 theory, result in separate commits being accidentally unified, since
 151 we might gradually adjust the two sides of the range such that they are
 152 eventually more than COMMIT_THRESHOLD seconds apart.  However, this is
 153 really a case of CVS not recording enough information to disambiguate
 154 the commits; we'd know we have a time range that exceeds the
 155 COMMIT_THRESHOLD, but we wouldn't necessarily know where to divide it
 156 up.  We could try some clever heuristic, but for now it's not
 157 important -- after all, we're talking about commits that weren't
 158 important enough to have a distinctive log message anyway, so does it
 159 really matter if a couple of them accidentally get unified?  Probably
 160 not.
 161
 162 'cvs2svn-tags.db' ??? is also created during this pass, for
 163 undocumented purposes.
 164
 165
 166 SortRevsPass (formerly called pass3)
 167 ============
 168
 169 This is where we deduce the changesets, that is, the grouping of file
 170 changes into single commits.
 171
 172 It's very simple -- run 'sort' on 'cvs2svn-c-revs.txt', converting it
 173 to 'cvs2svn-s-revs.txt'.  Because of the way the data is laid out,
 174 this causes commits with the same digest (that is, the same author and
 175 log message) to be grouped together.  Poof!  We now have the CVS
 176 changes grouped by logical commit.
 177
 178 In some cases, the changes in a given commit may be interleaved with
 179 other commits that went on at the same time, because the sort gives
 180 precedence to date before log digest.  However, CreateDatabasesPass
 181 detects this by seeing that the log digest is different, and
 182 reseparates the commits.
 183
 184
 185 CreateDatabasesPass (formerly called pass4):
 186 ===================
 187
 188 Find and create a database containing the last CVS revision that is a
 189 source (also referred to as an "opening" revision) for all symbolic
 190 names.  This will result in a database containing key-value pairs
 191 whose key is the id for a CVSRevision, and whose value is a list of
 192 symbolic names for which that CVSRevision is the last "opening."
 193
 194 The format for this file is:
 195
 196     'cvs2svn-symbol-last-cvs-revs.db':
 197          Key                      Value
 198          CVS Revision ID          array of Symbolic names
 199
 200     For example:
 201
 202          5c                      --> ['TAG11', 'BRANCH38']
 203          62                      --> ['TAG39']
 204          4d                      --> ['BRANCH48', 'BRANCH37']
 205          f                       --> ['TAG320', 'TAG1178']
 206
 207
 208 AggregateRevsPass (formerly called pass5)
 209 =================
 210
 211 Primarily, this pass gathers CVS revisions into Subversion revisions
 212 (a Subversion revision is comprised of one or more CVS revisions)
 213 before we actually begin committing (where "committing" means either
 214 to a Subversion repository or to a dump file).
 215
 216 This pass does the following:
 217
 218 1. Creates a database file to map Subversion Revision numbers to their
 219    corresponding CVS Revisions ('cvs2svn-svn-revnums-to-cvs-revs.db').
 220    Creates another database file to map CVS Revisions to their
 221    Subversion Revision numbers ('cvs2svn-cvs-revs-to-svn-revnums.db').
 222
 223 2. When a file is copied to a symbolic name in cvs2svn, there are a
 224    range of valid Subversion revisions that we can copy the file from.
 225    The first valid Subversion revision number for a symbolic name is
 226    called the "Opening", and the first *invalid* Subversion revision
 227    number encountered after the "Opening" is called the "Closing".  In
 228    this pass, the SymbolingsLogger class writes one line to
 229    'cvs2svn-symbolic-names.txt' per CVS file, per symbolic name, per
 230    opening or closing.
 231
 232 3. For each CVS Revision in s-revs, we write out a line (for each
 233    symbolic name that it opens) to cvs2svn-symbolic-names.txt if it is
 234    the first possible source revision (the "opening" revision) for a
 235    copy to create a branch or tag, or if it is the last possible
 236    revision (the "closing" revision) for a copy to create a branch or
 237    tag.  Not every opening will have a corresponding closing.
 238
 239    The format of each line is:
 240
 241        SYMBOLIC_NAME SVN_REVNUM TYPE BRANCH_NAME CVS_FILE_ID
 242
 243    For example:
 244
 245        MY_TAG1 234 O * 1a7
 246        MY_BRANCH3 245 O * 1a9
 247        MY_TAG1 241 C MY_BRANCH1 1a7
 248        MY_BRANCH_BLAH 201 O MY_BRANCH1 1b3
 249
 250    Here is what the columns mean:
 251
 252    SYMBOLIC_NAME: The name of the branch or tag that starts or ends
 253                   in this CVS Revision (there can be multiples per
 254                   CVS rev).
 255
 256    SVN_REVNUM: The Subversion revision number that is the opening or
 257                closing for this SYMBOLIC_NAME.
 258
 259    TYPE: "O" for Openings and "C" for Closings.
 260
 261    BRANCH_NAME: The (uncleaned) branch name where this opening or
 262                 closing happened.  '*' denotes the default branch.
 263
 264    CVS_FILE_ID: The ID of the CVS file where this opening or closing
 265                 happened, in hexadecimal.
 266
 267    See SymbolingsLogger for more details.
 268
 269 4. Creates a file 'cvs2svn-symbolic-names-closings-tmp.txt' ??? for
 270    undocumented purposes.
 271
 272
 273 SortSymbolsPass (formerly called pass6)
 274 ===============
 275
 276 This pass merely sorts 'cvs2svn-symbolic-names.txt' into
 277 'cvs2svn-symbolic-names-s.txt'.  This orders the file first by
 278 symbolic name, and second by Subversion revision number, thus grouping
 279 all openings and closings for each symbolic name together.
 280
 281
 282 IndexSymbolsPass (formerly called pass7)
 283 ================
 284
 285 This pass iterates through all the lines in
 286 'cvs2svn-symbolic-names-s.txt', writing out a database file
 287 ('cvs2svn-symbolic-name-offsets.db') mapping SYMBOLIC_NAME to the file
 288 offset in 'cvs2svn-symbolic-names-s.txt' where SYMBOLIC_NAME is first
 289 encountered.  This will allow us to seek to the various offsets in the
 290 file and sequentially read only the openings and closings that we
 291 need.
 292
 293
 294 OutputPass (formerly called pass8)
 295 ==========
 296
 297 This pass has very little "thinking" to do--it basically opens the
 298 svn-nums-to-cvs-revs.db and, starting with Subversion revision 2
 299 (revision 1 creates /trunk, /tags, and /branches), sequentially plays
 300 out all the commits to either a Subversion repository or to a
 301 dumpfile.
 302
 303 In --dump-only mode, the result of this pass is a Subversion
 304 repository dumpfile (suitable for input to 'svnadmin load').  The
 305 dumpfile is the data's last static stage: last chance to check over
 306 the data, run it through svndumpfilter, move the dumpfile to another
 307 machine, etc.
 308
 309 However, when not in --dump-only mode, no full dumpfile is created for
 310 subsequent load into a Subversion repository.  Instead, miniature
 311 dumpfiles representing a single revision are created, loaded into the
 312 repository, and then removed.
 313
 314 In both modes, the dumpfile revisions are created by walking through
 315 'cvs2svn-data.s-revs.txt'.
 316
 317 The databases 'cvs2svn-svn-nodes.db' and 'cvs2svn-svn-revisions.db'
 318 form a skeletal (metadata only, no content) mirror of the repository
 319 structure that cvs2svn is creating.  They provide data about previous
 320 revisions that cvs2svn requires while constructing the dumpstream.
 321
 322
 323                   ===============================
 324                       Branches and Tags Plan.
 325                   ===============================
 326
 327 This pass is also where tag and branch creation is done.  Since
 328 subversion does tags and branches by copying from existing revisions
 329 (then maybe editing the copy, making subcopies underneath, etc), the
 330 big question for cvs2svn is how to achieve the minimum number of
 331 operations per creation.  For example, if it's possible to get the
 332 right tag by just copying revision 53, then it's better to do that
 333 than, say, copying revision 51 and then sub-copying in bits of
 334 revision 52 and 53.
 335
 336 Also, since CVS does not version symbolic names, there is the
 337 secondary question of *when* to create a particular tag or branch.
 338 For example, a tag might have been made at any time after the youngest
 339 commit included in it, or might even have been made piecemeal; and the
 340 same is true for a branch, with the added constraint that for any
 341 particular file, the branch must have been created before the first
 342 commit on the branch.
 343
 344 Answering the second question first: cvs2svn creates tags as soon as
 345 possible and branches as late as possible.
 346
 347 Tags are created as soon as cvs2svn encounters the last CVS Revision
 348 that is a source for that tag.  The whole tag is created in one
 349 Subversion commit.
 350
 351 For branches, this is "just in time" creation -- the moment it sees
 352 the first commit on a branch, it snaps the entire branch into
 353 existence (or as much of it as possible), and then outputs the branch
 354 commit.
 355
 356 The reason we say "as much of it as possible" is that it's possible to
 357 have a branch where some files have branch commits occuring earlier
 358 than the other files even have the source revisions from which the
 359 branch sprouts (this can happen if the branch was created piecemeal,
 360 for example).  In this case, we create as much of the branch as we
 361 can, that is, as much of it as there are source revisions available to
 362 copy, and leave the rest for later.  "Later" might mean just until
 363 other branch commits come in, or else during a cleanup stage that
 364 happens at the end of this pass (about which more later).
 365
 366 How just-in-time branch creation works:
 367
 368 In order to make the "best" set of copies/deletes when creating a
 369 branch, cvs2svn keeps track of two sets of trees while it's making
 370 commits:
 371
 372    1. A skeleton mirror of the subversion repository, that is, an
 373       array of revisions, with a tree hanging off each revision.  (The
 374       "array" is actually implemented as an anydbm database itself,
 375       mapping string representations of numbers to root keys.)
 376
 377    2. A tree for each CVS symbolic name, and the svn file/directory
 378       revisions from which various parts of that tree could be copied.
 379
 380 Both tree sets live in anydbm databases, using the same basic schema:
 381 unique keys map to marshal.dumps() representations of dictionaries,
 382 which in turn map entry names to other unique keys:
 383
 384    root_key  ==> { entryname1 : entrykey1, entryname2 : entrykey2, ... }
 385    entrykey1 ==> { entrynameX : entrykeyX, ... }
 386    entrykey2 ==> { entrynameY : entrykeyY, ... }
 387    entrykeyX ==> { etc, etc ...}
 388    entrykeyY ==> { etc, etc ...}
 389
 390 (The leaf nodes -- files -- are also dictionaries, for simplicity.)
 391
 392 The repository mirror allows cvs2svn to remember what paths exist in
 393 what revisions.
 394
 395 For details on how branches and tags are created, please see the
 396 docstring the SymbolingsLogger class (and its methods).
 397
 398 -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*-
 399 - -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -
 400 -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*-
 401
 402 Some older notes and ideas about cvs2svn.  Not deleted, because they
 403 may contain suggestions for future improvements in design.
 404
 405 -----------------------------------------------------------------------
 406
 407 An email from John Gardiner Myers <jgmyers@speakeasy.net> about some
 408 considerations for the tool.
 409
 410 ------
 411 From: John Gardiner Myers <jgmyers@speakeasy.net>
 412 Subject: Thoughts on CVS to SVN conversion
 413 To: gstein@lyra.org
 414 Date: Sun, 15 Apr 2001 17:47:10 -0700
 415
 416 Some things you may want to consider for a CVS to SVN conversion utility:
 417
 418 If converting a CVS repository to SVN takes days, it would be good for
 419 the conversion utility to keep its progress state on disk.  If the
 420 conversion fails halfway through due to a network outage or power
 421 failure, that would allow the conversion to be resumed where it left off
 422 instead of having to start over from an empty SVN repository.
 423
 424 It is a short step from there to allowing periodic updates of a
 425 read-only SVN repository from a read/write CVS repository.  This allows
 426 the more relaxed conversion procedure:
 427
 428 1) Create SVN repository writable only by the conversion tool.
 429 2) Update SVN repository from CVS repository.
 430 3) Announce the time of CVS to SVN cutover.
 431 4) Repeat step (2) as needed.
 432 5) Disable commits to CVS repository, making it read-only.
 433 6) Repeat step (2).
 434 7) Enable commits to SVN repository.
 435 8) Wait for developers to move their workspaces to SVN.
 436 9) Decomission the CVS repository.
 437
 438 You may forward this message or parts of it as you seem fit.
 439 ------
 440
 441 -----------------------------------------------------------------------
 442
 443 Further design thoughts from Greg Stein <gstein@lyra.org>
 444
 445 * timestamp the beginning of the process. ignore any commits that
 446   occur after that timestamp; otherwise, you could miss portions of a
 447   commit (e.g. scan A; commit occurs to A and B; scan B; create SVN
 448   revision for items in B; we missed A)
 449
 450 * the above timestamp can also be used for John's "grab any updates
 451   that were missed in the previous pass."
 452
 453 * for each file processed, watch out for simultaneous commits. this
 454   may cause a problem during the reading/scanning/parsing of the file,
 455   or the parse succeeds but the results are garbaged. this could be
 456   fixed with a CVS lock, but I'd prefer read-only access.
 457
 458   algorithm: get the mtime before opening the file. if an error occurs
 459   during reading, and the mtime has changed, then restart the file. if
 460   the read is successful, but the mtime changed, then restart the
 461   file.
 462
 463 * use a separate log to track unique branches and non-branched forks
 464   of revision history (Q: is it possible to create, say, 1.4.1.3
 465   without a "real" branch?). this log can then be used to create a
 466   /branches/ directory in the SVN repository.
 467
 468   Note: we want to determine some way to coalesce branches across
 469   files. It can't be based on name, though, since the same branch name
 470   could be used in multiple places, yet they are semantically
 471   different branches. Given files R, S, and T with branch B, we can
 472   tie those files' branch B into a "semantic group" whenever we see
 473   commit groups on a branch touching multiple files. Files that are
 474   have a (named) branch but no commits on it are simply ignored. For
 475   each "semantic group" of a branch, we'd create a branch based on
 476   their common ancestor, then make the changes on the children as
 477   necessary. For single-file commits to a branch, we could use
 478   heuristics (pathname analysis) to add these to a group (and log what
 479   we did), or we could put them in a "reject" kind of file for a human
 480   to tell us what to do (the human would edit a config file of some
 481   kind to instruct the converter).
 482
 483 * if we have access to the CVSROOT/history, then we could process tags
 484   properly. otherwise, we can only use heuristics or configuration
 485   info to group up tags (branches can use commits; there are no
 486   commits associated with tags)
 487
 488 * ideally, we store every bit of data from the ,v files to enable a
 489   complete restoration of the CVS repository. this could be done by
 490   storing properties with CVS revision numbers and stuff (i.e. all
 491   metadata not already embodied by SVN would go into properties)
 492
 493 * how do we track the "states"? I presume "dead" is simply deleting
 494   the entry from SVN. what are the other legal states, and do we need
 495   to do anything with them?
 496
 497 * where do we put the "description"? how about locks, access list,
 498   keyword flags, etc.
 499
 500 * note that using something like the SourceForge repository will be an
 501   ideal test case. people *move* their repositories there, which means
 502   that all kinds of stuff can be found in those repositories, from
 503   wherever people used to run them, and under whatever development
 504   policies may have been used.
 505
 506   For example: I found one of the projects with a "permissions 644;"
 507   line in the "gnuplot" repository.  Most RCS releases issue warnings
 508   about that (although they properly handle/skip the lines), and CVS
 509   ignores RCS newphrases altogether.
 510
 511 # vim:tw=70