doc/design-notes.txt

   1                          How cvs2svn Works
   2                          =================
   3
   4                        Theory and requirements
   5                        ------ --- ------------
   6
   7 There are two main problem converting a CVS repository to SVN:
   8
   9 - CVS does not record enough information to determine what actually
  10   happened to a repository.  For example, CVS does not record:
  11
  12   - Which file modifications were part of the same commit
  13
  14   - The timestamp of tag and branch creations
  15
  16   - Exactly which revision was the base of a branch (there is
  17     ambiguity between x.y, x.y.2.0, x.y.4.0, etc.)
  18
  19   - When the default branch was changed (for example, from a vendor
  20     branch back to trunk).
  21
  22 - The timestamps in a CVS archive are not reliable.  It can easily
  23   happen that timestamps are not even monotonic, and large errors (for
  24   example due to a failing server clock battery) are not unusual.
  25
  26 The absolutely crucial, sine qua non requirement of a conversion is
  27 that the dependency relationships within a file be honored, mainly:
  28
  29 - A revision depends on its predecessor
  30
  31 - A branch creation depends on the revision from which it branched,
  32   and commits on the branch depend on the branch creation
  33
  34 - A tag creation depends on the revision being tagged
  35
  36 These dependencies are reliably defined in the CVS repository, and
  37 they trump all others, so they are the scaffolding of the conversion.
  38
  39 Moreover, it is highly desirable that the timestamps of the SVN
  40 commits be monotonically increasing.
  41
  42 Within these constraints we also want the results of the conversion to
  43 resemble the history of the CVS repository as closely as possible.
  44 For example, the set of file changes grouped together in an SVN commit
  45 should be the same as the files changed within the corresponding CVS
  46 commit, insofar as that can be achieved in a manner that is consistent
  47 with the dependency requirements.  And the SVN commit timestamps
  48 should recreate the time of the CVS commit as far as possible without
  49 violating the monotonicity requirement.
  50
  51 The basic idea of the conversion is this: create the largest
  52 conceivable changesets, then split up changesets as necessary to break
  53 any cycles in the graph of changeset dependencies.  When all cycles
  54 have been removed, then do a topological sort of the changesets (with
  55 ambiguities resolved using CVS timestamps) to determine a
  56 self-consistent changeset commit order.
  57
  58 The quality of the conversion (not in terms of correctness, but in
  59 terms of minimizing the number of svn commits) is mostly determined by
  60 the cleverness of the heuristics used to split up cycles.  And all of
  61 this has to be affordable, especially in terms of conversion time and
  62 RAM usage, for even the largest CVS repositories.
  63
  64
  65                             Implementation
  66                             --------------
  67
  68 A cvs2svn run consists of a number of passes.  Each pass saves the
  69 data it produces to files on disk, so that a) we don't hold huge
  70 amounts of state in memory, and b) the conversion process is
  71 resumable.
  72
  73 The intermediate files are referred to here by the symbolic constants
  74 holding their filenames in config.py.
  75
  76
  77 CollectRevsPass (formerly called pass1)
  78 ===============
  79
  80 The goal of this pass is to collect from the CVS files all of the data
  81 that will be required for the conversion.  If the --use-internal-co
  82 option was used, this pass also collects the file delta data; for
  83 -use-rcs or -use-cvs, the actual file contents are read again in
  84 OutputPass.
  85
  86 To collect this data, we walk over the repository, collecting data
  87 about the RCS files into an instance of CollectData.  Each RCS file is
  88 processed with rcsparse.parse(), which invokes callbacks from an
  89 instance of cvs2svn's _FileDataCollector class (which is a subclass of
  90 rcsparse.Sink).
  91
  92 While a file is being processed, all of the data for the file (except
  93 for contents and log messages) is held in memory.  When the file has
  94 been read completely, its data is converted into an instance of
  95 CVSFileItems, and this instance is manipulated a bit then pickled and
  96 stored to CVS_ITEMS_STORE.
  97
  98 For each RCS file, the first thing the parser encounters is the
  99 administrative header, including the head revision, the principal
 100 branch, symbolic names, RCS comments, etc.  The main thing that
 101 happens here is that _FileDataCollector.define_tag() is invoked on
 102 each symbolic name and its attached revision, so all the tags and
 103 branches of this file get collected.
 104
 105 Next, the parser hits the revision summary section.  That's the part
 106 of the RCS file that looks like this:
 107
 108    1.6
 109    date 2002.06.12.04.54.12;    author captnmark;       state Exp;
 110    branches
 111         1.6.2.1;
 112    next 1.5;
 113
 114    1.5
 115    date 2002.05.28.18.02.11;    author captnmark;       state Exp;
 116    branches;
 117    next 1.4;
 118
 119    [...]
 120
 121 For each revision summary, _FileDataCollector.define_revision() is
 122 invoked, recording that revision's metadata in various variables of
 123 the _FileDataCollector class instance.
 124
 125 Next, the parser encounters the *real* revision data, which has the
 126 log messages and file contents.  For each revision, it invokes
 127 _FileDataCollector.set_revision_info(), which sets some more fields in
 128 _RevisionData.  It also invokes RevisionRecorder.record_text(), which
 129 gives the RevisionRecorder the chance to record the file text if
 130 desired.  record_test() is allowed to return a token, which is carried
 131 along with the CVSRevision data and can be used by RevisionReader to
 132 retrieve the text in OutputPass.
 133
 134 When the parser is done with the file, _ProjectDataCollector takes the
 135 resulting CVSFileItems object and manipulates it to handle some CVS
 136 features:
 137
 138    - If the file had a vendor branch, make some adjustments to the
 139      file dependency graph to reflect implicit dependencies related to
 140      the vendor branch.  Also delete the 1.1 revision in the usual
 141      case that it doesn't contain any useful information.
 142
 143    - If the file was added on a branch rather than on trunk, then
 144      delete the "dead" 1.1 revision on trunk in the usual case that it
 145      doesn't contain any useful information.
 146
 147    - If the file was added on a branch after it already existed on
 148      trunk, then recent versions of CVS add an extra "dead" revision
 149      on the branch.  Remove this revision in the usual case that it
 150      doesn't contain any useful information, and sever the branch from
 151      trunk (since the branch version is independent of the trunk
 152      version).
 153
 154    - If the conversion was started with the --trunk-only option, then
 155
 156      1. graft any non-trunk default branch revisions onto trunk
 157         (because they affect the history of the default branch), and
 158
 159      2. delete all branches and tags and all remaining branch
 160         revisions.
 161
 162 Finally, the RevisionRecorder.finish_file() callback is called, the
 163 CVSFileItems instance is stored to a database, and statistics about
 164 how symbols were used in the file are recorded.
 165
 166 That's it -- the RCS file is done.
 167
 168 When every CVS file is done, CollectRevsPass is complete, and:
 169
 170    - The basic information about each project is stored to PROJECTS.
 171
 172    - The basic information about each file and directory (filename,
 173      path, etc) is written as a pickled CVSPath instance to
 174      CVS_PATHS_DB.
 175
 176    - Information about each symbol seen, along with statistics like
 177      how often it was used as a branch or tag, is written as a pickled
 178      symbol_statistics._Stat object to SYMBOL_STATISTICS.  This
 179      includes the following information:
 180
 181          ID -- a unique positive identifying integer
 182
 183          NAME -- the symbol name
 184
 185          TAG_CREATE_COUNT -- the number of times the symbol was used
 186              as a tag
 187
 188          BRANCH_CREATE_COUNT -- the number of times the symbol was
 189              used as a branch
 190
 191          BRANCH_COMMIT_COUNT -- the number of files in which there was
 192              a commit on a branch with this name.
 193
 194          BRANCH_BLOCKERS -- the set of other symbols that ever
 195              sprouted from a branch with this name.  (A symbol cannot
 196              be excluded from the conversion unless all of its
 197              blockers are also excluded.)
 198
 199          POSSIBLE_PARENTS -- a count of in how many files each other
 200              branch could have served as the symbol's source.
 201
 202      These data are used to look for inconsistencies in the use of
 203      symbols under CVS and to decide which symbols can be excluded or
 204      forced to be branches and/or tags.  The POSSIBLE_PARENTS data is
 205      used to pick the "optimum" parent from which the symbol should
 206      sprout in as many files as possible.
 207
 208      For a multiproject conversion, distinct symbol records (and IDs)
 209      are created for symbols in separate projects, even if they have
 210      the same name.  This is to prevent symbols in separate projects
 211      from being filled at the same time.
 212
 213    - Information about each CVS event is converted into a CVSItem
 214      instance and stored to CVS_ITEMS_STORE.  There are several types
 215      of CVSItems:
 216
 217          CVSRevision -- A specific revision of a specific CVS file.
 218
 219          CVSBranch -- The creation of a branch tag in a specific CVS
 220              file.
 221
 222          CVSTag -- The creation of a non-branch tag in a specific CVS
 223              file.
 224
 225      The CVSItems are grouped into CVSFileItems instances, one per
 226      CVSFile.  But a multi-file commit will still be scattered all
 227      over the place.
 228
 229    - Selected metadata for each CVS revision, including the author and
 230      log message, is written to METADATA_INDEX_TABLE and
 231      METADATA_STORE.  The purpose is twofold: first, to save space by
 232      not having to save this information multiple times, and second
 233      because CVSRevisions that have the same metadata are candidates
 234      to be combined into an SVN changeset.
 235
 236      First, an SHA digest is created for each set of metadata.  The
 237      digest is constructed so that CVSRevisions that can be combined
 238      are all mapped to the same digest.  CVSRevisions that were part
 239      of a single CVS commit always have a common author and log
 240      message, therefore these fields are always included in the
 241      digest.  Moreover:
 242
 243      - if ctx.cross_project_commits is False, we avoid combining CVS
 244        revisions from separate projects by including the project.id in
 245        the digest.
 246
 247      - if ctx.cross_branch_commits is False, we avoid combining CVS
 248        revisions from different branches by including the branch name
 249        in the digest.
 250
 251      During the database creation phase, the database keeps track of a
 252      map
 253
 254        digest (20-byte string) -> metadata_id (int)
 255
 256      to allow the record for a set of metadata to be located
 257      efficiently.  As data are collected, it stores a map
 258
 259        metadata_id (int) -> (author, log_msg,) (tuple)
 260
 261      into the database for use in future passes.  CVSRevision records
 262      include the metadata_id.
 263
 264 During this run, each CVSFile, Symbol, CVSItem, and metadata record is
 265 assigned an arbitrary unique ID that is used throughout the conversion
 266 to refer to it.
 267
 268
 269 CleanMetadataPass
 270 =================
 271
 272 Encode the cvs revision metadata as UTF-8, ensuring that all entries
 273 can be decoded using the chosen encodings.  Output the results to
 274 METADATA_CLEAN_INDEX_TABLE and METADATA_CLEAN_STORE.
 275
 276
 277 CollateSymbolsPass
 278 ==================
 279
 280 Use the symbol statistics collected in CollectRevsPass and any runtime
 281 options to determine which symbols should be treated as branches,
 282 which as tags, and which should be excluded from the conversion
 283 altogether.
 284
 285 Create SYMBOL_DB, which contains a pickle of a list of TypedSymbol
 286 (Branch, Tag, or ExcludedSymbol) instances indicating how each symbol
 287 should be processed in the conversion.  The IDs used for a TypedSymbol
 288 is the same as the ID allocated to the corresponding symbol in
 289 CollectRevsPass, so references in CVSItems do not have to be updated.
 290
 291
 292 FilterSymbolsPass
 293 =================
 294
 295 This pass works through the CVSFileItems instances stored in
 296 CVS_ITEMS_STORE, processing all of the items from each file as a
 297 group.  (This is the last pass in which all of the CVSItems for a file
 298 are in memory at once.)  It does the following things:
 299
 300    - Exclude any symbols that CollateSymbolsPass determined should be
 301      excluded, and any revisions on such branches.  Also delete
 302      references from other CVSItems to those that are being deleted.
 303
 304    - Transform any branches to tags or vice versa, also depending on
 305      the results of CollateSymbolsPass, and fix up the references from
 306      other CVSItems.
 307
 308    - Decide what line of development to use as the parent for each
 309      symbol in the file, and adjust the file's dependency tree
 310      accordingly.
 311
 312    - For each CVSRevision, record the list of symbols that the
 313      revision opens and closes.
 314
 315    - Write each surviving CVSRevision to CVS_REVS_DATAFILE.  Each line
 316      of the file has the format
 317
 318          METADATA_ID TIMESTAMP CVS_REVISION
 319
 320      where TIMESTAMP is a fixed-width timestamp, and CVS_REVISION is
 321      the pickled CVSRevision in a format that does not contain any
 322      newlines.  These summaries will be sorted in SortRevisionsPass
 323      then used by InitializeChangesetsPass to create preliminary
 324      RevisionChangesets.
 325
 326    - Write the CVSSymbols to CVS_SYMBOLS_DATAFILE.  Each line of the
 327      file has the format
 328
 329          SYMBOL_ID CVS_SYMBOL
 330
 331      where CVS_SYMBOL is the pickled CVSSymbol in a format that does
 332      not contain any newlines.  This information will be sorted by
 333      SYMBOL_ID in SortSymbolsPass then used to create preliminary
 334      SymbolChangesets.
 335
 336
 337 SortRevisionsPass
 338 =================
 339
 340 Sort CVS_REVS_DATAFILE (written by FilterSymbolsPass), creating
 341 CVS_REVS_SORTED_DATAFILE.  The sort groups items that might be added
 342 to the same changeset together and, within a group, sorts revisions by
 343 timestamp.  This step makes it easy for InitializeChangesetsPass to
 344 read the initial draft of RevisionChangesets straight from the file.
 345
 346
 347 SortSymbolsPass
 348 ===============
 349
 350 Sort CVS_SYMBOLS_DATAFILE (written by FilterSymbolsPass), creating
 351 CVS_SYMBOLS_SORTED_DATAFILE.  The sort groups together symbol items
 352 that might be added to the same changeset (though not in anything
 353 resembling chronological order).  The output of this pass is used by
 354 InitializeChangesetsPass.
 355
 356
 357 InitializeChangesetsPass
 358 ========================
 359
 360 This pass creates first-draft changesets, splitting them using
 361 COMMIT_THRESHOLD and breaking up any revision changesets that have
 362 internal dependencies.
 363
 364 The raw material for creating revision changesets is
 365 CVS_REVS_SORTED_DATAFILE, which already has CVSRevisions sorted in
 366 such a way that potential changesets are grouped together and sorted
 367 by date.  The contents of this file are read line by line, and the
 368 corresponding CVSRevisions are accumulated into a changeset.  Whenever
 369 the metadata_id changes, or whenever there is a time gap of more than
 370 COMMIT_THRESHOLD (currently set to 5 minutes) between CVSRevisions,
 371 then a new changeset is started.
 372
 373 At this point a revision changeset can have internal dependencies if
 374 two commits were made to the same file with the same log message
 375 within COMMIT_THRESHOLD of each other.  The next job of this pass is
 376 to split up changesets in such a way to break such internal
 377 dependencies.  This is done by sorting the CVSRevisions within a
 378 changeset by timestamp, then choosing the split point that breaks the
 379 most internal dependencies.  This procedure is continued recursively
 380 until there are no more dependencies internal to a single changeset.
 381
 382 Analogously, the CVSSymbol items from CVS_SYMBOLS_SORTED_DATAFILE are
 383 grouped into symbol changesets.  (Symbol changesets cannot have
 384 internal dependencies, so there is no need to break them up at this
 385 stage.)
 386
 387 Finally, this pass writes a CVSItem database with the CVSItems written
 388 in order grouped by the preliminary changeset to which they belong.
 389 Even though the preliminary changesets still have to be split up to
 390 form final changesets, grouping the CVSItems this way improves the
 391 locality of disk accesses and thereby speeds up later passes.
 392
 393 The result of this pass is two databases:
 394
 395    - CVS_ITEM_TO_CHANGESET, which maps CVSItem ids to the id of the
 396      changeset containing the item, and
 397
 398    - CHANGESETS_STORE and CHANGESETS_INDEX, which contain the
 399      changeset objects themselves, indexed by changeset id.
 400
 401    - CVS_ITEMS_SORTED_STORE and CVS_ITEMS_SORTED_INDEX_TABLE, which
 402      contain the pickled CVSItems ordered by changeset.
 403
 404
 405 BreakRevisionChangesetCyclesPass
 406 ================================
 407
 408 There can still be cycles in the dependency graph of
 409 RevisionChangesets caused by:
 410
 411    - Interleaved commits.  Since CVS commits are not atomic, it can
 412      happen that two commits are in progress at the same time and each
 413      alters the same two files, but in different orders.  These should
 414      be small cycles involving only a few revision changesets.  To
 415      resolve these cycles, one or more of the RevisionChangesets have
 416      to be split up (eventually becoming separate svn commits).
 417
 418    - Cycles involving a RevisionChangeset formed by the accidental
 419      combination of unrelated items within a short period of time that
 420      have the same author and log message.  These should also be small
 421      cycles involving only a few changesets.
 422
 423 The job of this pass is to break up such cycles (those involving only
 424 CVSRevisions).
 425
 426 This pass works by building up the graph of revision changesets and
 427 their dependencies in memory, then attempting a topological sort of
 428 the changesets.  Whenever the topological sort stalls, that implies
 429 the existence of a cycle, one of which can easily be determined.  This
 430 cycle is broken through the use of heuristics that try to determine an
 431 "efficient" way of splitting one or more of the changesets that are
 432 involved.
 433
 434 The new RevisionChangesets are written to
 435 CVS_ITEM_TO_CHANGESET_REVBROKEN, CHANGESETS_REVBROKEN_STORE, and
 436 CHANGESETS_REVBROKEN_INDEX, along with the unmodified
 437 SymbolChangesets.  These files are in the same format as the analogous
 438 files produced by InitializeChangesetsPass.
 439
 440
 441 RevisionTopologicalSortPass
 442 ===========================
 443
 444 Topologically sort the RevisionChangesets, thereby picking the order
 445 in which the RevisionChangesets will be committed.  (Since the
 446 previous pass eliminated any dependency cycles, this sort is
 447 guaranteed to succeed.)  Ambiguities in the topological sort are
 448 resolved using the changesets' timestamps.  Then simplify the
 449 changeset graph into a linear chain by converting each
 450 RevisionChangeset into an OrderedChangeset that stores dependency
 451 links only to its commit-order predecessor and successor.  This
 452 simplified graph enforces the commit order that resulted from the
 453 topological sort, even after the SymbolChangesets are added back into
 454 the graph later.  Store the OrderedChangesets into
 455 CHANGESETS_REVSORTED_STORE and CHANGESETS_REVSORTED_INDEX along with
 456 the unmodified SymbolChangesets.
 457
 458
 459 BreakSymbolChangesetCyclesPass
 460 ==============================
 461
 462 It is possible for there to be cycles in the graph of SymbolChangesets
 463 caused by:
 464
 465    - Split creation of branches.  It is possible that branch A depends
 466      on branch B in one file, but B depends on A in another file.
 467      These cycles can be large, but they only involve
 468      SymbolChangesets.
 469
 470 Break up such dependency loops.  Output the results to
 471 CVS_ITEM_TO_CHANGESET_SYMBROKEN, CHANGESETS_SYMBROKEN_STORE, and
 472 CHANGESETS_SYMBROKEN_INDEX.
 473
 474
 475 BreakAllChangesetCyclesPass
 476 ===========================
 477
 478 The complete changeset graph (including both RevisionChangesets and
 479 BranchChangesets) can still have dependency cycles cause by:
 480
 481    - Split creation of branches.  The same branch tag can be added to
 482      different files at completely different times.  It is possible
 483      that the revision that was branched later depends on a
 484      RevisionChangeset that involves a file on the branch that was
 485      created earlier.  These cycles can be large, but they always
 486      involve a SymbolChangeset.  To resolve these cycles, the
 487      SymbolChangeset is split up into two changesets.
 488
 489 In fact, tag changesets do not have to be considered--CVSTags cannot
 490 participate in dependency cycles because no other CVSItem can depend
 491 on a CVSTag.
 492
 493 Since the input of this pass has been through
 494 RevisionTopologicalSortPass, all revision cycles have already been
 495 broken up and the order that the RevisionChangesets will be committed
 496 has been determined.  In this pass, the complete changeset graph is
 497 created in memory, including the linear list of OrderedChangesets from
 498 RevisionTopologicalSortPass plus all of the symbol changesets.
 499 Because this pass doesn't break up any OrderedChangesets, it is
 500 constrained to finding places within the revision changeset sequence
 501 in which the symbol changeset commits can be inserted.
 502
 503 The new changesets are written to CVS_ITEM_TO_CHANGESET_ALLBROKEN,
 504 CHANGESETS_ALLBROKEN_STORE, and CHANGESETS_ALLBROKEN_INDEX, which are
 505 in the same format as the analogous files produced by
 506 InitializeChangesetsPass.
 507
 508
 509 TopologicalSortPass
 510 ===================
 511
 512 Now that the earlier passes have broken up any dependency cycles among
 513 the changesets, it is possible to order all of the changesets in such
 514 a way that all of a changeset's dependencies are committed before the
 515 changeset itself.  This pass does so by again building up the graph of
 516 changesets in memory, then at each step picking a changeset that has
 517 no remaining dependencies and removing it from the graph.  Whenever
 518 more than one dependency-free changeset is available, symbol
 519 changesets are chosen before revision changesets.  As changesets are
 520 processed, the timestamp sequence is ensured to be monotonic by the
 521 simple expedient of adjusting retrograde timestamps to be later than
 522 their predecessor.  Timestamps that lie in the future, on the other
 523 hand, are assumed to be bogus and are adjusted backwards, also to be
 524 just later than their predecessor.
 525
 526 This pass writes a line to CHANGESETS_SORTED_DATAFILE for each
 527 RevisionChangeset, in the order that the changesets should be
 528 committed.  Each lines contains
 529
 530     CHANGESET_ID TIMESTAMP
 531
 532 where CHANGESET_ID is the id of the changeset in the
 533 CHANGESETS_ALLBROKEN_* databases and TIMESTAMP is the timstamp that
 534 should be assigned to it when it is committed.  Both values are
 535 written in hexadecimal.
 536
 537
 538 CreateRevsPass (formerly called pass5)
 539 ==============
 540
 541 This pass generates SVNCommits from Changesets and records symbol
 542 openings and closings.  (One Changeset can result in multiple
 543 SVNCommits, for example if it causes symbols to be filled or copies to
 544 a vendor branch.)
 545
 546 This pass does the following:
 547
 548 1. Creates a database file to map Subversion revision numbers to
 549    SVNCommit instances (SVN_COMMITS_STORE and
 550    SVN_COMMITS_INDEX_TABLE).  Creates another database file to map CVS
 551    Revisions to their Subversion Revision numbers
 552    (CVS_REVS_TO_SVN_REVNUMS).
 553
 554 2. When a file is copied to a symbolic name in cvs2svn, it is copied
 555    from a specific source: either a CVSRevision, or a copy created by
 556    a previous CVSBranch of the file.  The copy has to be made from an
 557    SVN revision that is during the lifetime of the source.  The SVN
 558    revision when the source was created is called the symbol's
 559    "opening", and the SVN revision when it was deleted or overwritten
 560    is called the symbol's "closing".  In this pass, the
 561    SymbolingsLogger class writes out a line to
 562    SYMBOL_OPENINGS_CLOSINGS for each symbol opening or closing.  Note
 563    that some openings do not have closings, namely if the
 564    corresponding source is still present at the HEAD revision.
 565
 566    The format of each line is:
 567
 568        SYMBOL_ID SVN_REVNUM TYPE CVS_SYMBOL_ID
 569
 570    For example:
 571
 572        1c 234 O 1a7
 573        34 245 O 1a9
 574        18a 241 C 1a7
 575        122 201 O 1b3
 576
 577    Here is what the columns mean:
 578
 579    SYMBOL_ID -- The id of the branch or tag that has an opening in
 580        this SVN_REVNUM, in hexadecimal.
 581
 582    SVN_REVNUM -- The Subversion revision number in which the opening
 583        or closing occurred.  (There can be multiple openings and
 584        closings per SVN_REVNUM).
 585
 586    TYPE -- "O" for openings and "C" for closings.
 587
 588    CVS_SYMBOL_ID -- The id of the CVSSymbol instance whose opening or
 589        closing is being described, in hexadecimal.
 590
 591    Each CVSSymbol that tags a non-dead file has exactly one opening
 592    and either zero or one closing.  The closing, if it exists, always
 593    occurs in a later SVN revision than the opening.
 594
 595    See SymbolingsLogger for more details.
 596
 597
 598 SortSymbolOpeningsClosingsPass (formerly called pass6)
 599 ==============================
 600
 601 This pass sorts SYMBOL_OPENINGS_CLOSINGS into
 602 SYMBOL_OPENINGS_CLOSINGS_SORTED.  This orders the file first by symbol
 603 ID, and second by Subversion revision number, thus grouping all
 604 openings and closings for each symbolic name together.
 605
 606
 607 IndexSymbolsPass (formerly called pass7)
 608 ================
 609
 610 This pass iterates through all the lines in
 611 SYMBOL_OPENINGS_CLOSINGS_SORTED, writing out a pickle file
 612 (SYMBOL_OFFSETS_DB) mapping SYMBOL_ID to the file offset in
 613 SYMBOL_OPENINGS_CLOSINGS_SORTED where SYMBOL_ID is first encountered.
 614 This will allow us to seek to the various offsets in the file and
 615 sequentially read only the openings and closings that we need.
 616
 617
 618 OutputPass (formerly called pass8)
 619 ==========
 620
 621 This pass opens the svn-commits database and sequentially plays out
 622 all the commits to either a Subversion repository or to a dumpfile.
 623 It also decides what sources to use to fill symbols.
 624
 625 In --dumpfile mode, the result of this pass is a Subversion repository
 626 dumpfile (suitable for input to 'svnadmin load').  The dumpfile is the
 627 data's last static stage: last chance to check over the data, run it
 628 through svndumpfilter, move the dumpfile to another machine, etc.
 629
 630 When not in --dumpfile mode, no full dumpfile is created.  Instead,
 631 miniature dumpfiles representing a single revisions are created,
 632 loaded into the repository, and then removed.
 633
 634 In both modes, the dumpfile revisions are created by walking through
 635 the SVN_COMMITS_* database.
 636
 637 The database in MIRROR_NODES_STORE and MIRROR_NODES_INDEX_TABLE holds
 638 a skeletal mirror of the repository structure at each SVN revision.
 639 This mirror keeps track of which files existed on each LOD, but does
 640 not record any file contents.  cvs2svn requires this information to
 641 decide which paths to copy when filling branches and tags.
 642
 643 When .cvsignore files are modified, cvs2svn computes the corresponding
 644 svn:ignore properties and applies the properties to the parent
 645 directory.  The .cvsignore files themselves are not included in the
 646 output unless the --keep-cvsignore option was specified.  But in
 647 either case, the .cvsignore files are recorded within the repository
 648 mirror as if they were being written to disk, to ensure that the
 649 containing directory is not pruned if the directory in CVS still
 650 contained a .cvsignore file.
 651
 652
 653                   ===============================
 654                       Branches and Tags Plan.
 655                   ===============================
 656
 657 This pass is also where tag and branch creation is done.  Since
 658 subversion does tags and branches by copying from existing revisions
 659 (then maybe editing the copy, making subcopies underneath, etc), the
 660 big question for cvs2svn is how to achieve the minimum number of
 661 operations per creation.  For example, if it's possible to get the
 662 right tag by just copying revision 53, then it's better to do that
 663 than, say, copying revision 51 and then sub-copying in bits of
 664 revision 52 and 53.
 665
 666 Tags are created as soon as cvs2svn encounters the last CVS Revision
 667 that is a source for that tag.  The whole tag is created in one
 668 Subversion commit.
 669
 670 Branches are created as soon as all of their prerequisites are in
 671 place.  If a branch creation had to be broken up due to dependency
 672 cycles, then non-final parts are also created as soon as their
 673 prerequisites are ready.  In such a case, the SymbolChangeset
 674 specifies how much of the branch can be created in each step.
 675
 676 How just-in-time branch creation works:
 677
 678 In order to make the "best" set of copies/deletes when creating a
 679 branch, cvs2svn keeps track of two sets of trees while it's making
 680 commits:
 681
 682    1. A skeleton mirror of the subversion repository, that is, a
 683       record of which file existed on which LOD for each SVN revision.
 684
 685    2. A tree for each CVS symbolic name, and the svn file/directory
 686       revisions from which various parts of that tree could be copied.
 687
 688 Each LOD is recorded as a tree using the following schema: unique keys
 689 map to marshal.dumps() representations of dictionaries, which in turn
 690 map path component names to other unique keys:
 691
 692    root_key  ==> { entryname1 : entrykey1, entryname2 : entrykey2, ... }
 693    entrykey1 ==> { entrynameX : entrykeyX, ... }
 694    entrykey2 ==> { entrynameY : entrykeyY, ... }
 695    entrykeyX ==> { etc, etc ...}
 696    entrykeyY ==> { etc, etc ...}
 697
 698 (The leaf nodes -- files -- are represented by None.)
 699
 700 The repository mirror allows cvs2svn to remember what paths exist in
 701 what revisions.
 702
 703 For details on how branches and tags are created, please see the
 704 docstring the SymbolingsLogger class (and its methods).
 705
 706