doc/design-notes.txt

   1                          How cvs2svn Works
   2                          =================
   3
   4                        Theory and requirements
   5                        ------ --- ------------
   6
   7 There are two main problem converting a CVS repository to SVN:
   8
   9 - CVS does not record enough information to determine what actually
  10   happened to a repository.  For example, CVS does not record:
  11
  12   - Which file modifications were part of the same commit
  13
  14   - The timestamp of tag and branch creations
  15
  16   - Exactly which revision was the base of a branch (there is
  17     ambiguity between x.y, x.y.2.0, x.y.4.0, etc.)
  18
  19   - When the default branch was changed (for example, from a vendor
  20     branch back to trunk).
  21
  22 - The timestamps in a CVS archive are not reliable.  It can easily
  23   happen that timestamps are not even monotonic, and large errors (for
  24   example due to a failing server clock battery) are not unusual.
  25
  26 The absolutely crucial, sine qua non requirement of a conversion is
  27 that the dependency relationships within a file be honored, mainly:
  28
  29 - A revision depends on its predecessor
  30
  31 - A branch creation depends on the revision from which it branched,
  32   and commits on the branch depend on the branch creation
  33
  34 - A tag creation depends on the revision being tagged
  35
  36 These dependencies are reliably defined in the CVS repository, and
  37 they trump all others, so they are the scaffolding of the conversion.
  38
  39 Moreover, it is highly desirable that the timestamps of the SVN
  40 commits be monotonically increasing.
  41
  42 Within these constraints we also want the results of the conversion to
  43 resemble the history of the CVS repository as closely as possible.
  44 For example, the set of file changes grouped together in an SVN commit
  45 should be the same as the files changed within the corresponding CVS
  46 commit, insofar as that can be achieved in a manner that is consistent
  47 with the dependency requirements.  And the SVN commit timestamps
  48 should recreate the time of the CVS commit as far as possible without
  49 violating the monotonicity requirement.
  50
  51 The basic idea of the conversion is this: create the largest
  52 conceivable changesets, then split up changesets as necessary to break
  53 any cycles in the graph of changeset dependencies.  When all cycles
  54 have been removed, then do a topological sort of the changesets (with
  55 ambiguities resolved using CVS timestamps) to determine a
  56 self-consistent changeset commit order.
  57
  58 The quality of the conversion (not in terms of correctness, but in
  59 terms of minimizing the number of svn commits) is mostly determined by
  60 the cleverness of the heuristics used to split up cycles.  And all of
  61 this has to be affordable, especially in terms of conversion time and
  62 RAM usage, for even the largest CVS repositories.
  63
  64
  65                             Implementation
  66                             --------------
  67
  68 A cvs2svn run consists of a number of passes.  Each pass saves the
  69 data it produces to files on disk, so that a) we don't hold huge
  70 amounts of state in memory, and b) the conversion process is
  71 resumable.
  72
  73 CollectRevsPass (formerly called pass1)
  74 ===============
  75
  76 The goal of this pass is to collect from the CVS files all of the data
  77 that will be required for the conversion.  If the --use-internal-co
  78 option was used, this pass also collects the file delta data; for
  79 -use-rcs or -use-cvs, the actual file contents are read again in
  80 OutputPass.
  81
  82 To collect this data, we walk over the repository, collecting data
  83 about the RCS files into an instance of CollectData.  Each RCS file is
  84 processed with rcsparse.parse(), which invokes callbacks from an
  85 instance of cvs2svn's _FileDataCollector class (which is a subclass of
  86 rcsparse.Sink).
  87
  88 While a file is being processed, all of the data for the file (except
  89 for contents and log messages) is held in memory.  When the file has
  90 been read completely, its data is converted into an instance of
  91 CVSFileItems, and this instance is manipulated a bit then pickled and
  92 stored to 'cvs-items.pck'.
  93
  94 For each RCS file, the first thing the parser encounters is the
  95 administrative header, including the head revision, the principal
  96 branch, symbolic names, RCS comments, etc.  The main thing that
  97 happens here is that _FileDataCollector.define_tag() is invoked on
  98 each symbolic name and its attached revision, so all the tags and
  99 branches of this file get collected.
 100
 101 Next, the parser hits the revision summary section.  That's the part
 102 of the RCS file that looks like this:
 103
 104    1.6
 105    date 2002.06.12.04.54.12;    author captnmark;       state Exp;
 106    branches
 107         1.6.2.1;
 108    next 1.5;
 109
 110    1.5
 111    date 2002.05.28.18.02.11;    author captnmark;       state Exp;
 112    branches;
 113    next 1.4;
 114
 115    [...]
 116
 117 For each revision summary, _FileDataCollector.define_revision() is
 118 invoked, recording that revision's metadata in various variables of
 119 the _FileDataCollector class instance.
 120
 121 Next, the parser encounters the *real* revision data, which has the
 122 log messages and file contents.  For each revision, it invokes
 123 _FileDataCollector.set_revision_info(), which sets some more fields in
 124 _RevisionData.  It also invokes RevisionRecorder.record_text(), which
 125 gives the RevisionRecorder the chance to record the file text if
 126 desired.  record_test() is allowed to return a token, which is carried
 127 along with the CVSRevision data and can be used by RevisionReader to
 128 retrieve the text in OutputPass.
 129
 130 When the parser is done with the file, _ProjectDataCollector takes the
 131 resulting CVSFileItems object and manipulates it to handle some CVS
 132 features:
 133
 134    - If the file had a vendor branch, make some adjustments to the
 135      file dependency graph to reflect implicit dependencies related to
 136      the vendor branch.  Also delete the 1.1 revision in the usual
 137      case that it doesn't contain any useful information.
 138
 139    - If the file was added on a branch rather than on trunk, then
 140      delete the "dead" 1.1 revision on trunk in the usual case that it
 141      doesn't contain any useful information.
 142
 143    - If the file was added on a branch after it already existed on
 144      trunk, then recent versions of CVS add an extra "dead" revision
 145      on the branch.  Remove this revision in the usual case that it
 146      doesn't contain any useful information, and sever the branch from
 147      trunk (since the branch version is independent of the trunk
 148      version).
 149
 150    - If the conversion was started with the --trunk-only option, then
 151
 152      1. graft any non-trunk default branch revisions onto trunk
 153         (because they affect the history of the default branch), and
 154
 155      2. delete all branches and tags and all remaining branch
 156         revisions.
 157
 158 Finally, the RevisionRecorder.finish_file() callback is called, the
 159 CVSFileItems instance is stored to a database, and statistics about
 160 how symbols were used in the file are recorded.
 161
 162 That's it -- the RCS file is done.
 163
 164 When every CVS file is done, CollectRevsPass is complete, and:
 165
 166    - The basic information about each file and directory (filename,
 167      path, etc) is written as a pickled CVSPath instance to
 168      'cvs-paths.db'.
 169
 170    - Information about each symbol seen, along with statistics like
 171      how often it was used as a branch or tag, is written as a pickled
 172      symbol_statistics._Stat object to 'symbol-statistics.pck'.  This
 173      includes the following information:
 174
 175          ID -- a unique positive identifying integer
 176
 177          NAME -- the symbol name
 178
 179          TAG_CREATE_COUNT -- the number of times the symbol was used
 180              as a tag
 181
 182          BRANCH_CREATE_COUNT -- the number of times the symbol was
 183              used as a branch
 184
 185          BRANCH_COMMIT_COUNT -- the number of files in which there was
 186              a commit on a branch with this name.
 187
 188          BRANCH_BLOCKERS -- the set of other symbols that ever
 189              sprouted from a branch with this name.  (A symbol cannot
 190              be excluded from the conversion unless all of its
 191              blockers are also excluded.)
 192
 193          POSSIBLE_PARENTS -- a count of in how many files each other
 194              branch could have served as the symbol's source.
 195
 196      These data are used to look for inconsistencies in the use of
 197      symbols under CVS and to decide which symbols can be excluded or
 198      forced to be branches and/or tags.  The POSSIBLE_PARENTS data is
 199      used to pick the "optimum" parent from which the symbol should
 200      sprout in as many files as possible.
 201
 202      For a multiproject conversion, distinct symbol records (and IDs)
 203      are created for symbols in separate projects, even if they have
 204      the same name.  This is to prevent symbols in separate projects
 205      from being filled at the same time.
 206
 207    - Information about each CVS event is converted into a CVSItem
 208      instance and stored to 'cvs-items.pck'.  There are several types
 209      of CVSItems:
 210
 211          CVSRevision -- A specific revision of a specific CVS file.
 212
 213          CVSBranch -- The creation of a branch tag in a specific CVS
 214              file.
 215
 216          CVSTag -- The creation of a non-branch tag in a specific CVS
 217              file.
 218
 219      The CVSItems are grouped into CVSFileItems instances, one per
 220      CVSFile.  But a multi-file commit will still be scattered all
 221      over the place.
 222
 223    - Selected metadata for each CVS revision, including the author and
 224      log message, is written to 'metadata-index.dat' and
 225      'metadata.pck'.  The purpose is twofold: first, to save space by
 226      not having to save this information multiple times, and second
 227      because CVSRevisions that have the same metadata are candidates
 228      to be combined into an SVN changeset.
 229
 230      First, an SHA digest is created for each set of metadata.  The
 231      digest is constructed so that CVSRevisions that can be combined
 232      are all mapped to the same digest.  CVSRevisions that were part
 233      of a single CVS commit always have a common author and log
 234      message, therefore these fields are always included in the
 235      digest.  Moreover:
 236
 237      - if ctx.cross_project_commits is False, we avoid combining CVS
 238        revisions from separate projects by including the project.id in
 239        the digest.
 240
 241      - if ctx.cross_branch_commits is False, we avoid combining CVS
 242        revisions from different branches by including the branch name
 243        in the digest.
 244
 245      During the database creation phase, the database keeps track of a
 246      map
 247
 248        digest (20-byte string) -> metadata_id (int)
 249
 250      to allow the record for a set of metadata to be located
 251      efficiently.  As data are collected, it stores a map
 252
 253        metadata_id (int) -> (author, log_msg,) (tuple)
 254
 255      into the database for use in future passes.  CVSRevision records
 256      include the metadata_id.
 257
 258 During this run, each CVSFile, Symbol, CVSItem, and metadata record is
 259 assigned an arbitrary unique ID that is used throughout the conversion
 260 to refer to it.
 261
 262
 263 CleanMetadataPass
 264 =================
 265
 266 Encode the cvs revision metadata as UTF-8, ensuring that all entries
 267 can be decoded using the chosen encodings.  Output the results to
 268 'metadata-clean-index.dat' and 'metadata-clean.pck'.
 269
 270
 271 CollateSymbolsPass
 272 ==================
 273
 274 Use the symbol statistics collected in CollectRevsPass and any runtime
 275 options to determine which symbols should be treated as branches,
 276 which as tags, and which should be excluded from the conversion
 277 altogether.
 278
 279 Create 'symbols.pck', which contains a pickle of a list of TypedSymbol
 280 (Branch, Tag, or ExcludedSymbol) instances indicating how each symbol
 281 should be processed in the conversion.  The IDs used for a TypedSymbol
 282 is the same as the ID allocated to the corresponding symbol in
 283 CollectRevsPass, so references in CVSItems do not have to be updated.
 284
 285
 286 FilterSymbolsPass
 287 =================
 288
 289 This pass works through the CVSFileItems instances stored in
 290 'cvs-items.pck', processing all of the items from each file as a
 291 group.  (This is the last pass in which all of the CVSItems for a file
 292 are in memory at once.)  It does the following things:
 293
 294    - Exclude any symbols that CollateSymbolsPass determined should be
 295      excluded, and any revisions on such branches.  Also delete
 296      references from other CVSItems to those that are being deleted.
 297
 298    - Transform any branches to tags or vice versa, also depending on
 299      the results of CollateSymbolsPass, and fix up the references from
 300      other CVSItems.
 301
 302    - Decide what line of development to use as the parent for each
 303      symbol in the file, and adjust the file's dependency tree
 304      accordingly.
 305
 306    - For each CVSRevision, record the list of symbols that the
 307      revision opens and closes.
 308
 309    - Write a summary of each surviving CVSRevision to
 310      'revs-summary.txt'.  Each line of the file has the format
 311
 312          METADATA_ID TIMESTAMP CVS_REVISION
 313
 314      where TIMESTAMP is a fixed-width timestamp, and CVS_REVISION is
 315      the pickled CVSRevision in a format that does not contain any
 316      newlines.  These summaries will be sorted in
 317      SortRevisionSummaryPass then used by InitializeChangesetsPass to
 318      create preliminary RevisionChangesets.
 319
 320    - Write a summary of CVSSymbols to 'symbols-summary.txt'.  Each
 321      line of the file has the format
 322
 323          SYMBOL_ID CVS_SYMBOL
 324
 325      where CVS_SYMBOL is the pickled CVSSymbol in a format that does
 326      not contain any newlines.  This information will be sorted by
 327      SYMBOL_ID in SortSymbolSummaryPass then used to create
 328      preliminary SymbolChangesets.
 329
 330
 331 SortRevisionSummaryPass
 332 =======================
 333
 334 Sort the revision summary written by FilterSymbolsPass, creating
 335 'revs-summary-s.txt'.  The sort groups items that might be added to
 336 the same changeset together and, within a group, sorts revisions by
 337 timestamp.  This step makes it easy for InitializeChangesetsPass to
 338 read the initial draft of RevisionChangesets straight from the file.
 339
 340
 341 SortSymbolSummaryPass
 342 =====================
 343
 344 Sort the symbol summary written by FilterSymbolsPass, creating
 345 'symbols-summary-s.txt'.  The sort groups together symbol items that
 346 might be added to the same changeset (though not in anything
 347 resembling chronological order).  The output of this pass is used by
 348 InitializeChangesetsPass.
 349
 350
 351 InitializeChangesetsPass
 352 ========================
 353
 354 This pass creates first-draft changesets, splitting them using
 355 COMMIT_THRESHOLD and breaking up any revision changesets that have
 356 internal dependencies.
 357
 358 The raw material for creating revision changesets is
 359 'revs-summary-s.txt', which already has CVSRevisions sorted in such a
 360 way that potential changesets are grouped together and sorted by date.
 361 The contents of this file are read line by line, and the corresponding
 362 CVSRevisions are accumulated into a changeset.  Whenever the
 363 metadata_id changes, or whenever there is a time gap of more than
 364 COMMIT_THRESHOLD (currently set to 5 minutes) between CVSRevisions,
 365 then a new changeset is started.
 366
 367 At this point a revision changeset can have internal dependencies if
 368 two commits were made to the same file with the same log message
 369 within COMMIT_THRESHOLD of each other.  The next job of this pass is
 370 to split up changesets in such a way to break such internal
 371 dependencies.  This is done by sorting the CVSRevisions within a
 372 changeset by timestamp, then choosing the split point that breaks the
 373 most internal dependencies.  This procedure is continued recursively
 374 until there are no more dependencies internal to a single changeset.
 375
 376 Analogously, the CVSSymbol items from 'symbols-summary-s.txt' are
 377 grouped into symbol changesets.  (Symbol changesets cannot have
 378 internal dependencies, so there is no need to break them up at this
 379 stage.)
 380
 381 Finally, this pass writes a CVSItem database with the CVSItems written
 382 in order grouped by the preliminary changeset to which they belong.
 383 Even though the preliminary changesets still have to be split up to
 384 form final changesets, grouping the CVSItems this way improves the
 385 locality of disk accesses and thereby speeds up later passes.
 386
 387 The result of this pass is two databases:
 388
 389    - 'cvs-item-to-changeset.dat', which maps CVSItem ids to the id of
 390      the changeset containing the item, and
 391
 392    - 'changesets.pck' and 'changesets-index.dat', which contain the
 393      changeset objects themselves, indexed by changeset id.
 394
 395    - 'cvs-items-sorted-index.dat' and 'cvs-items-sorted.pck', which
 396      contain the pickled CVSItems ordered by changeset.
 397
 398
 399 BreakRevisionChangesetCyclesPass
 400 ================================
 401
 402 There can still be cycles in the dependency graph of
 403 RevisionChangesets caused by:
 404
 405    - Interleaved commits.  Since CVS commits are not atomic, it can
 406      happen that two commits are in progress at the same time and each
 407      alters the same two files, but in different orders.  These should
 408      be small cycles involving only a few revision changesets.  To
 409      resolve these cycles, one or more of the RevisionChangesets have
 410      to be split up (eventually becoming separate svn commits).
 411
 412    - Cycles involving a RevisionChangeset formed by the accidental
 413      combination of unrelated items within a short period of time that
 414      have the same author and log message.  These should also be small
 415      cycles involving only a few changesets.
 416
 417 The job of this pass is to break up such cycles (those involving only
 418 CVSRevisions).
 419
 420 This pass works by building up the graph of revision changesets and
 421 their dependencies in memory, then attempting a topological sort of
 422 the changesets.  Whenever the topological sort stalls, that implies
 423 the existence of a cycle, one of which can easily be determined.  This
 424 cycle is broken through the use of heuristics that try to determine an
 425 "efficient" way of splitting one or more of the changesets that are
 426 involved.
 427
 428 The new RevisionChangesets are written to
 429 'cvs-item-to-changeset-revbroken.dat', 'changesets-revbroken.pck', and
 430 'changesets-revbroken-index.dat', along with the unmodified
 431 SymbolChangesets.  These files are in the same format as the analogous
 432 files produced by InitializeChangesetsPass.
 433
 434
 435 RevisionTopologicalSortPass
 436 ===========================
 437
 438 Topologically sort the RevisionChangesets, thereby picking the order
 439 in which the RevisionChangesets will be committed.  (Since the
 440 previous pass eliminated any dependency cycles, this sort is
 441 guaranteed to succeed.)  Ambiguities in the topological sort are
 442 resolved using the changesets' timestamps.  Then simplify the
 443 changeset graph into a linear chain by converting each
 444 RevisionChangeset into an OrderedChangeset that stores dependency
 445 links only to its commit-order predecessor and successor.  This
 446 simplified graph enforces the commit order that resulted from the
 447 topological sort, even after the SymbolChangesets are added back into
 448 the graph later.  Store the OrderedChangesets into
 449 'changesets-revsorted.pck' and 'changesets-revsorted-index.dat' along
 450 with the unmodified SymbolChangesets.
 451
 452
 453 BreakSymbolChangesetCyclesPass
 454 ==============================
 455
 456 It is possible for there to be cycles in the graph of SymbolChangesets
 457 caused by:
 458
 459    - Split creation of branches.  It is possible that branch A depends
 460      on branch B in one file, but B depends on A in another file.
 461      These cycles can be large, but they only involve
 462      SymbolChangesets.
 463
 464 Break up such dependency loops.  Output the results to
 465 'cvs-item-to-changeset-symbroken.dat',
 466 'changesets-symbroken-index.dat', and 'changesets-symbroken.pck'.
 467
 468
 469 BreakAllChangesetCyclesPass
 470 ===========================
 471
 472 The complete changeset graph (including both RevisionChangesets and
 473 BranchChangesets) can still have dependency cycles cause by:
 474
 475    - Split creation of branches.  The same branch tag can be added to
 476      different files at completely different times.  It is possible
 477      that the revision that was branched later depends on a
 478      RevisionChangeset that involves a file on the branch that was
 479      created earlier.  These cycles can be large, but they always
 480      involve a SymbolChangeset.  To resolve these cycles, the
 481      SymbolChangeset is split up into two changesets.
 482
 483 In fact, tag changesets do not have to be considered--CVSTags cannot
 484 participate in dependency cycles because no other CVSItem can depend
 485 on a CVSTag.
 486
 487 Since the input of this pass has been through
 488 RevisionTopologicalSortPass, all revision cycles have already been
 489 broken up and the order that the RevisionChangesets will be committed
 490 has been determined.  In this pass, the complete changeset graph is
 491 created in memory, including the linear list of OrderedChangesets from
 492 RevisionTopologicalSortPass plus all of the symbol changesets.
 493 Because this pass doesn't break up any OrderedChangesets, it is
 494 constrained to finding places within the revision changeset sequence
 495 in which the symbol changeset commits can be inserted.
 496
 497 The new changesets are written to
 498 'cvs-item-to-changeset-allbroken.dat', 'changesets-allbroken.pck', and
 499 'changesets-allbroken-index.dat', which are in the same format as the
 500 analogous files produced by InitializeChangesetsPass.
 501
 502
 503 TopologicalSortPass
 504 ===================
 505
 506 Now that the earlier passes have broken up any dependency cycles among
 507 the changesets, it is possible to order all of the changesets in such
 508 a way that all of a changeset's dependencies are committed before the
 509 changeset itself.  This pass does so by again building up the graph of
 510 changesets in memory, then at each step picking a changeset that has
 511 no remaining dependencies and removing it from the graph.  Whenever
 512 more than one dependency-free changeset is available, symbol
 513 changesets are chosen before revision changesets.  As changesets are
 514 processed, the timestamp sequence is ensured to be monotonic by the
 515 simple expedient of adjusting retrograde timestamps to be later than
 516 their predecessor.  Timestamps that lie in the future, on the other
 517 hand, are assumed to be bogus and are adjusted backwards, also to be
 518 just later than their predecessor.
 519
 520 This pass writes a line to 'changesets-s.txt' for each
 521 RevisionChangeset, in the order that the changesets should be
 522 committed.  Each lines contains
 523
 524     CHANGESET_ID TIMESTAMP
 525
 526 where CHANGESET_ID is the id of the changeset in the
 527 'changesets-allbroken' databases and TIMESTAMP is the timstamp that
 528 should be assigned to it when it is committed.  Both values are
 529 written in hexadecimal.
 530
 531
 532 CreateRevsPass (formerly called pass5)
 533 ==============
 534
 535 This pass generates SVNCommits from Changesets and records symbol
 536 openings and closings.  (One Changeset can result in multiple
 537 SVNCommits, for example if it causes symbols to be filled or copies to
 538 a vendor branch.)
 539
 540 This pass does the following:
 541
 542 1. Creates a database file to map Subversion revision numbers to
 543    SVNCommit instances ('svn-commits-index.dat' and
 544    'svn-commits.pck').  Creates another database file to map CVS
 545    Revisions to their Subversion Revision numbers
 546    ('cvs-revs-to-svn-revnums.db').
 547
 548 2. When a file is copied to a symbolic name in cvs2svn, it is copied
 549    from a specific source: either a CVSRevision, or a copy created by
 550    a previous CVSBranch of the file.  The copy has to be made from an
 551    SVN revision that is during the lifetime of the source.  The SVN
 552    revision when the source was created is called the symbol's
 553    "opening", and the SVN revision when it was deleted or overwritten
 554    is called the symbol's "closing".  In this pass, the
 555    SymbolingsLogger class writes out a line to 'symbolic-names.txt'
 556    for each symbol opening or closing.  Note that some openings do not
 557    have closings, namely if the corresponding source is still present
 558    at the HEAD revision.
 559
 560    The format of each line is:
 561
 562        SYMBOL_ID SVN_REVNUM TYPE CVS_SYMBOL_ID
 563
 564    For example:
 565
 566        1c 234 O 1a7
 567        34 245 O 1a9
 568        18a 241 C 1a7
 569        122 201 O 1b3
 570
 571    Here is what the columns mean:
 572
 573    SYMBOL_ID -- The id of the branch or tag that has an opening in
 574        this SVN_REVNUM, in hexadecimal.
 575
 576    SVN_REVNUM -- The Subversion revision number in which the opening
 577        or closing occurred.  (There can be multiple openings and
 578        closings per SVN_REVNUM).
 579
 580    TYPE -- "O" for openings and "C" for closings.
 581
 582    CVS_SYMBOL_ID -- The id of the CVSSymbol instance whose opening or
 583        closing is being described, in hexadecimal.
 584
 585    Each CVSSymbol that tags a non-dead file has exactly one opening
 586    and either zero or one closing.  The closing, if it exists, always
 587    occurs in a later SVN revision than the opening.
 588
 589    See SymbolingsLogger for more details.
 590
 591
 592 SortSymbolsPass (formerly called pass6)
 593 ===============
 594
 595 This pass sorts 'symbolic-names.txt' into 'symbolic-names-s.txt'.
 596 This orders the file first by symbol ID, and second by Subversion
 597 revision number, thus grouping all openings and closings for each
 598 symbolic name together.
 599
 600
 601 IndexSymbolsPass (formerly called pass7)
 602 ================
 603
 604 This pass iterates through all the lines in 'symbolic-names-s.txt',
 605 writing out a pickle file ('symbol-offsets.pck') mapping SYMBOL_ID to
 606 the file offset in 'symbolic-names-s.txt' where SYMBOL_ID is first
 607 encountered.  This will allow us to seek to the various offsets in the
 608 file and sequentially read only the openings and closings that we
 609 need.
 610
 611
 612 OutputPass (formerly called pass8)
 613 ==========
 614
 615 This pass opens the svn-commits database and sequentially plays out
 616 all the commits to either a Subversion repository or to a dumpfile.
 617 It also decides what sources to use to fill symbols.
 618
 619 In --dumpfile mode, the result of this pass is a Subversion repository
 620 dumpfile (suitable for input to 'svnadmin load').  The dumpfile is the
 621 data's last static stage: last chance to check over the data, run it
 622 through svndumpfilter, move the dumpfile to another machine, etc.
 623
 624 When not in --dumpfile mode, no full dumpfile is created.  Instead,
 625 miniature dumpfiles representing a single revisions are created,
 626 loaded into the repository, and then removed.
 627
 628 In both modes, the dumpfile revisions are created by walking through
 629 'data.s-revs.txt'.
 630
 631 The database 'mirror-nodes.db' holds a skeletal mirror of the
 632 repository structure at each SVN revision.  This mirror keeps track of
 633 which files existed on each LOD, but does not record any file
 634 contents.  cvs2svn requires this information to decide which paths to
 635 copy when filling branches and tags.
 636
 637 When .cvsignore files are modified, cvs2svn computes the corresponding
 638 svn:ignore properties and applies the properties to the parent
 639 directory.  The .cvsignore files themselves are not included in the
 640 output unless the --keep-cvsignore option was specified.  But in
 641 either case, the .cvsignore files are recorded within the repository
 642 mirror as if they were being written to disk, to ensure that the
 643 containing directory is not pruned if the directory in CVS still
 644 contained a .cvsignore file.
 645
 646
 647                   ===============================
 648                       Branches and Tags Plan.
 649                   ===============================
 650
 651 This pass is also where tag and branch creation is done.  Since
 652 subversion does tags and branches by copying from existing revisions
 653 (then maybe editing the copy, making subcopies underneath, etc), the
 654 big question for cvs2svn is how to achieve the minimum number of
 655 operations per creation.  For example, if it's possible to get the
 656 right tag by just copying revision 53, then it's better to do that
 657 than, say, copying revision 51 and then sub-copying in bits of
 658 revision 52 and 53.
 659
 660 Tags are created as soon as cvs2svn encounters the last CVS Revision
 661 that is a source for that tag.  The whole tag is created in one
 662 Subversion commit.
 663
 664 Branches are created as soon as all of their prerequisites are in
 665 place.  If a branch creation had to be broken up due to dependency
 666 cycles, then non-final parts are also created as soon as their
 667 prerequisites are ready.  In such a case, the SymbolChangeset
 668 specifies how much of the branch can be created in each step.
 669
 670 How just-in-time branch creation works:
 671
 672 In order to make the "best" set of copies/deletes when creating a
 673 branch, cvs2svn keeps track of two sets of trees while it's making
 674 commits:
 675
 676    1. A skeleton mirror of the subversion repository, that is, a
 677       record of which file existed on which LOD for each SVN revision.
 678
 679    2. A tree for each CVS symbolic name, and the svn file/directory
 680       revisions from which various parts of that tree could be copied.
 681
 682 Each LOD is recorded as a tree using the following schema: unique keys
 683 map to marshal.dumps() representations of dictionaries, which in turn
 684 map path component names to other unique keys:
 685
 686    root_key  ==> { entryname1 : entrykey1, entryname2 : entrykey2, ... }
 687    entrykey1 ==> { entrynameX : entrykeyX, ... }
 688    entrykey2 ==> { entrynameY : entrykeyY, ... }
 689    entrykeyX ==> { etc, etc ...}
 690    entrykeyY ==> { etc, etc ...}
 691
 692 (The leaf nodes -- files -- are represented by None.)
 693
 694 The repository mirror allows cvs2svn to remember what paths exist in
 695 what revisions.
 696
 697 For details on how branches and tags are created, please see the
 698 docstring the SymbolingsLogger class (and its methods).
 699
 700 -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*-
 701 - -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -
 702 -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*-
 703
 704 Some older notes and ideas about cvs2svn.  Not deleted, because they
 705 may contain suggestions for future improvements in design.
 706
 707 -----------------------------------------------------------------------
 708
 709 An email from John Gardiner Myers <jgmyers@speakeasy.net> about some
 710 considerations for the tool.
 711
 712 ------
 713 From: John Gardiner Myers <jgmyers@speakeasy.net>
 714 Subject: Thoughts on CVS to SVN conversion
 715 To: gstein@lyra.org
 716 Date: Sun, 15 Apr 2001 17:47:10 -0700
 717
 718 Some things you may want to consider for a CVS to SVN conversion utility:
 719
 720 If converting a CVS repository to SVN takes days, it would be good for
 721 the conversion utility to keep its progress state on disk.  If the
 722 conversion fails halfway through due to a network outage or power
 723 failure, that would allow the conversion to be resumed where it left off
 724 instead of having to start over from an empty SVN repository.
 725
 726 It is a short step from there to allowing periodic updates of a
 727 read-only SVN repository from a read/write CVS repository.  This allows
 728 the more relaxed conversion procedure:
 729
 730 1) Create SVN repository writable only by the conversion tool.
 731 2) Update SVN repository from CVS repository.
 732 3) Announce the time of CVS to SVN cutover.
 733 4) Repeat step (2) as needed.
 734 5) Disable commits to CVS repository, making it read-only.
 735 6) Repeat step (2).
 736 7) Enable commits to SVN repository.
 737 8) Wait for developers to move their workspaces to SVN.
 738 9) Decomission the CVS repository.
 739
 740 You may forward this message or parts of it as you seem fit.
 741 ------
 742
 743 -----------------------------------------------------------------------
 744
 745 Further design thoughts from Greg Stein <gstein@lyra.org>
 746
 747 * timestamp the beginning of the process. ignore any commits that
 748   occur after that timestamp; otherwise, you could miss portions of a
 749   commit (e.g. scan A; commit occurs to A and B; scan B; create SVN
 750   revision for items in B; we missed A)
 751
 752 * the above timestamp can also be used for John's "grab any updates
 753   that were missed in the previous pass."
 754
 755 * for each file processed, watch out for simultaneous commits. this
 756   may cause a problem during the reading/scanning/parsing of the file,
 757   or the parse succeeds but the results are garbaged. this could be
 758   fixed with a CVS lock, but I'd prefer read-only access.
 759
 760   algorithm: get the mtime before opening the file. if an error occurs
 761   during reading, and the mtime has changed, then restart the file. if
 762   the read is successful, but the mtime changed, then restart the
 763   file.
 764
 765 * use a separate log to track unique branches and non-branched forks
 766   of revision history (Q: is it possible to create, say, 1.4.1.3
 767   without a "real" branch?). this log can then be used to create a
 768   /branches/ directory in the SVN repository.
 769
 770   Note: we want to determine some way to coalesce branches across
 771   files. It can't be based on name, though, since the same branch name
 772   could be used in multiple places, yet they are semantically
 773   different branches. Given files R, S, and T with branch B, we can
 774   tie those files' branch B into a "semantic group" whenever we see
 775   commit groups on a branch touching multiple files. Files that are
 776   have a (named) branch but no commits on it are simply ignored. For
 777   each "semantic group" of a branch, we'd create a branch based on
 778   their common ancestor, then make the changes on the children as
 779   necessary. For single-file commits to a branch, we could use
 780   heuristics (pathname analysis) to add these to a group (and log what
 781   we did), or we could put them in a "reject" kind of file for a human
 782   to tell us what to do (the human would edit a config file of some
 783   kind to instruct the converter).
 784
 785 * if we have access to the CVSROOT/history, then we could process tags
 786   properly. otherwise, we can only use heuristics or configuration
 787   info to group up tags (branches can use commits; there are no
 788   commits associated with tags)
 789
 790 * ideally, we store every bit of data from the ,v files to enable a
 791   complete restoration of the CVS repository. this could be done by
 792   storing properties with CVS revision numbers and stuff (i.e. all
 793   metadata not already embodied by SVN would go into properties)
 794
 795 * how do we track the "states"? I presume "dead" is simply deleting
 796   the entry from SVN. what are the other legal states, and do we need
 797   to do anything with them?
 798
 799 * where do we put the "description"? how about locks, access list,
 800   keyword flags, etc.
 801
 802 * note that using something like the SourceForge repository will be an
 803   ideal test case. people *move* their repositories there, which means
 804   that all kinds of stuff can be found in those repositories, from
 805   wherever people used to run them, and under whatever development
 806   policies may have been used.
 807
 808   For example: I found one of the projects with a "permissions 644;"
 809   line in the "gnuplot" repository.  Most RCS releases issue warnings
 810   about that (although they properly handle/skip the lines), and CVS
 811   ignores RCS newphrases altogether.
 812
 813 # vim:tw=70