4 Theory and requirements
5 ------ --- ------------
7 There are two main problem converting a CVS repository to SVN:
9 - CVS does not record enough information to determine what actually
10 happened to a repository. For example, CVS does not record:
12 - Which file modifications were part of the same commit
14 - The timestamp of tag and branch creations
16 - Exactly which revision was the base of a branch (there is
17 ambiguity between x.y, x.y.2.0, x.y.4.0, etc.)
19 - When the default branch was changed (for example, from a vendor
20 branch back to trunk).
22 - The timestamps in a CVS archive are not reliable. It can easily
23 happen that timestamps are not even monotonic, and large errors (for
24 example due to a failing server clock battery) are not unusual.
26 The absolutely crucial, sine qua non requirement of a conversion is
27 that the dependency relationships within a file be honored, mainly:
29 - A revision depends on its predecessor
31 - A branch creation depends on the revision from which it branched,
32 and commits on the branch depend on the branch creation
34 - A tag creation depends on the revision being tagged
36 These dependencies are reliably defined in the CVS repository, and
37 they trump all others, so they are the scaffolding of the conversion.
39 Moreover, it is highly desirable that the timestamps of the SVN
40 commits be monotonically increasing.
42 Within these constraints we also want the results of the conversion to
43 resemble the history of the CVS repository as closely as possible.
44 For example, the set of file changes grouped together in an SVN commit
45 should be the same as the files changed within the corresponding CVS
46 commit, insofar as that can be achieved in a manner that is consistent
47 with the dependency requirements. And the SVN commit timestamps
48 should recreate the time of the CVS commit as far as possible without
49 violating the monotonicity requirement.
51 The basic idea of the conversion is this: create the largest
52 conceivable changesets, then split up changesets as necessary to break
53 any cycles in the graph of changeset dependencies. When all cycles
54 have been removed, then do a topological sort of the changesets (with
55 ambiguities resolved using CVS timestamps) to determine a
56 self-consistent changeset commit order.
58 The quality of the conversion (not in terms of correctness, but in
59 terms of minimizing the number of svn commits) is mostly determined by
60 the cleverness of the heuristics used to split up cycles. And all of
61 this has to be affordable, especially in terms of conversion time and
62 RAM usage, for even the largest CVS repositories.
68 A cvs2svn run consists of a number of passes. Each pass saves the
69 data it produces to files on disk, so that a) we don't hold huge
70 amounts of state in memory, and b) the conversion process is
73 CollectRevsPass (formerly called pass1)
76 The goal of this pass is to collect from the CVS files all of the data
77 that will be required for the conversion. If the --use-internal-co
78 option was used, this pass also collects the file delta data; for
79 -use-rcs or -use-cvs, the actual file contents are read again in
82 To collect this data, we walk over the repository, collecting data
83 about the RCS files into an instance of CollectData. Each RCS file is
84 processed with rcsparse.parse(), which invokes callbacks from an
85 instance of cvs2svn's _FileDataCollector class (which is a subclass of
88 While a file is being processed, all of the data for the file (except
89 for contents and log messages) is held in memory. When the file has
90 been read completely, its data is converted into an instance of
91 CVSFileItems, and this instance is manipulated a bit then pickled and
92 stored to 'cvs-items.pck'.
94 For each RCS file, the first thing the parser encounters is the
95 administrative header, including the head revision, the principal
96 branch, symbolic names, RCS comments, etc. The main thing that
97 happens here is that _FileDataCollector.define_tag() is invoked on
98 each symbolic name and its attached revision, so all the tags and
99 branches of this file get collected.
101 Next, the parser hits the revision summary section. That's the part
102 of the RCS file that looks like this:
105 date 2002.06.12.04.54.12; author captnmark; state Exp;
111 date 2002.05.28.18.02.11; author captnmark; state Exp;
117 For each revision summary, _FileDataCollector.define_revision() is
118 invoked, recording that revision's metadata in various variables of
119 the _FileDataCollector class instance.
121 Next, the parser encounters the *real* revision data, which has the
122 log messages and file contents. For each revision, it invokes
123 _FileDataCollector.set_revision_info(), which sets some more fields in
124 _RevisionData. It also invokes RevisionRecorder.record_text(), which
125 gives the RevisionRecorder the chance to record the file text if
126 desired. record_test() is allowed to return a token, which is carried
127 along with the CVSRevision data and can be used by RevisionReader to
128 retrieve the text in OutputPass.
130 When the parser is done with the file, _ProjectDataCollector takes the
131 resulting CVSFileItems object and manipulates it to handle some CVS
134 - If the file had a vendor branch, make some adjustments to the
135 file dependency graph to reflect implicit dependencies related to
136 the vendor branch. Also delete the 1.1 revision in the usual
137 case that it doesn't contain any useful information.
139 - If the file was added on a branch rather than on trunk, then
140 delete the "dead" 1.1 revision on trunk in the usual case that it
141 doesn't contain any useful information.
143 - If the file was added on a branch after it already existed on
144 trunk, then recent versions of CVS add an extra "dead" revision
145 on the branch. Remove this revision in the usual case that it
146 doesn't contain any useful information, and sever the branch from
147 trunk (since the branch version is independent of the trunk
150 - If the conversion was started with the --trunk-only option, then
152 1. graft any non-trunk default branch revisions onto trunk
153 (because they affect the history of the default branch), and
155 2. delete all branches and tags and all remaining branch
158 Finally, the RevisionRecorder.finish_file() callback is called, the
159 CVSFileItems instance is stored to a database, and statistics about
160 how symbols were used in the file are recorded.
162 That's it -- the RCS file is done.
164 When every CVS file is done, CollectRevsPass is complete, and:
166 - The basic information about each file (filename, path, etc) is
167 written as a pickled CVSFile instance to 'cvs-files.db'.
169 - Information about each symbol seen, along with statistics like
170 how often it was used as a branch or tag, is written as a pickled
171 symbol_statistics._Stat object to 'symbol-statistics.pck'. This
172 includes the following information:
174 ID -- a unique positive identifying integer
176 NAME -- the symbol name
178 TAG_CREATE_COUNT -- the number of times the symbol was used
181 BRANCH_CREATE_COUNT -- the number of times the symbol was
184 BRANCH_COMMIT_COUNT -- the number of files in which there was
185 a commit on a branch with this name.
187 BRANCH_BLOCKERS -- the set of other symbols that ever
188 sprouted from a branch with this name. (A symbol cannot
189 be excluded from the conversion unless all of its
190 blockers are also excluded.)
192 POSSIBLE_PARENTS -- a count of in how many files each other
193 branch could have served as the symbol's source.
195 These data are used to look for inconsistencies in the use of
196 symbols under CVS and to decide which symbols can be excluded or
197 forced to be branches and/or tags. The POSSIBLE_PARENTS data is
198 used to pick the "optimum" parent from which the symbol should
199 sprout in as many files as possible.
201 For a multiproject conversion, distinct symbol records (and IDs)
202 are created for symbols in separate projects, even if they have
203 the same name. This is to prevent symbols in separate projects
204 from being filled at the same time.
206 - Information about each CVS event is converted into a CVSItem
207 instance and stored to 'cvs-items.pck'. There are several types
210 CVSRevision -- A specific revision of a specific CVS file.
212 CVSBranch -- The creation of a branch tag in a specific CVS
215 CVSTag -- The creation of a non-branch tag in a specific CVS
218 The CVSItems are grouped into CVSFileItems instances, one per
219 CVSFile. But a multi-file commit will still be scattered all
222 - Selected metadata for each CVS revision, including the author and
223 log message, is written to 'metadata-index.dat' and
224 'metadata.pck'. The purpose is twofold: first, to save space by
225 not having to save this information multiple times, and second
226 because CVSRevisions that have the same metadata are candidates
227 to be combined into an SVN changeset.
229 First, an SHA digest is created for each set of metadata. The
230 digest is constructed so that CVSRevisions that can be combined
231 are all mapped to the same digest. CVSRevisions that were part
232 of a single CVS commit always have a common author and log
233 message, therefore these fields are always included in the
236 - if ctx.cross_project_commits is False, we avoid combining CVS
237 revisions from separate projects by including the project.id in
240 - if ctx.cross_branch_commits is False, we avoid combining CVS
241 revisions from different branches by including the branch name
244 During the database creation phase, the database keeps track of a
247 digest (20-byte string) -> metadata_id (int)
249 to allow the record for a set of metadata to be located
250 efficiently. As data are collected, it stores a map
252 metadata_id (int) -> (author, log_msg,) (tuple)
254 into the database for use in future passes. CVSRevision records
255 include the metadata_id.
257 During this run, each CVSFile, Symbol, CVSItem, and metadata record is
258 assigned an arbitrary unique ID that is used throughout the conversion
265 Encode the cvs revision metadata as UTF-8, ensuring that all entries
266 can be decoded using the chosen encodings. Output the results to
267 'metadata-clean-index.dat' and 'metadata-clean.pck'.
273 Use the symbol statistics collected in CollectRevsPass and any runtime
274 options to determine which symbols should be treated as branches,
275 which as tags, and which should be excluded from the conversion
278 Create 'symbols.pck', which contains a pickle of a list of TypedSymbol
279 (Branch, Tag, or ExcludedSymbol) instances indicating how each symbol
280 should be processed in the conversion. The IDs used for a TypedSymbol
281 is the same as the ID allocated to the corresponding symbol in
282 CollectRevsPass, so references in CVSItems do not have to be updated.
288 This pass works through the CVSFileItems instances stored in
289 'cvs-items.pck', processing all of the items from each file as a
290 group. (This is the last pass in which all of the CVSItems for a file
291 are in memory at once.) It does the following things:
293 - Exclude any symbols that CollateSymbolsPass determined should be
294 excluded, and any revisions on such branches. Also delete
295 references from other CVSItems to those that are being deleted.
297 - Transform any branches to tags or vice versa, also depending on
298 the results of CollateSymbolsPass, and fix up the references from
301 - Decide what line of development to use as the parent for each
302 symbol in the file, and adjust the file's dependency tree
305 - For each CVSRevision, record the list of symbols that the
306 revision opens and closes.
308 - Write a summary of each surviving CVSRevision to
309 'revs-summary.txt'. Each line of the file has the format
311 METADATA_ID TIMESTAMP CVS_REVISION
313 where TIMESTAMP is a fixed-width timestamp, and CVS_REVISION is
314 the pickled CVSRevision in a format that does not contain any
315 newlines. These summaries will be sorted in
316 SortRevisionSummaryPass then used by InitializeChangesetsPass to
317 create preliminary RevisionChangesets.
319 - Write a summary of CVSSymbols to 'symbols-summary.txt'. Each
320 line of the file has the format
324 where CVS_SYMBOL is the pickled CVSSymbol in a format that does
325 not contain any newlines. This information will be sorted by
326 SYMBOL_ID in SortSymbolSummaryPass then used to create
327 preliminary SymbolChangesets.
330 SortRevisionSummaryPass
331 =======================
333 Sort the revision summary written by FilterSymbolsPass, creating
334 'revs-summary-s.txt'. The sort groups items that might be added to
335 the same changeset together and, within a group, sorts revisions by
336 timestamp. This step makes it easy for InitializeChangesetsPass to
337 read the initial draft of RevisionChangesets straight from the file.
340 SortSymbolSummaryPass
341 =====================
343 Sort the symbol summary written by FilterSymbolsPass, creating
344 'symbols-summary-s.txt'. The sort groups together symbol items that
345 might be added to the same changeset (though not in anything
346 resembling chronological order). The output of this pass is used by
347 InitializeChangesetsPass.
350 InitializeChangesetsPass
351 ========================
353 This pass creates first-draft changesets, splitting them using
354 COMMIT_THRESHOLD and breaking up any revision changesets that have
355 internal dependencies.
357 The raw material for creating revision changesets is
358 'revs-summary-s.txt', which already has CVSRevisions sorted in such a
359 way that potential changesets are grouped together and sorted by date.
360 The contents of this file are read line by line, and the corresponding
361 CVSRevisions are accumulated into a changeset. Whenever the
362 metadata_id changes, or whenever there is a time gap of more than
363 COMMIT_THRESHOLD (currently set to 5 minutes) between CVSRevisions,
364 then a new changeset is started.
366 At this point a revision changeset can have internal dependencies if
367 two commits were made to the same file with the same log message
368 within COMMIT_THRESHOLD of each other. The next job of this pass is
369 to split up changesets in such a way to break such internal
370 dependencies. This is done by sorting the CVSRevisions within a
371 changeset by timestamp, then choosing the split point that breaks the
372 most internal dependencies. This procedure is continued recursively
373 until there are no more dependencies internal to a single changeset.
375 Analogously, the CVSSymbol items from 'symbols-summary-s.txt' are
376 grouped into symbol changesets. (Symbol changesets cannot have
377 internal dependencies, so there is no need to break them up at this
380 Finally, this pass writes a CVSItem database with the CVSItems written
381 in order grouped by the preliminary changeset to which they belong.
382 Even though the preliminary changesets still have to be split up to
383 form final changesets, grouping the CVSItems this way improves the
384 locality of disk accesses and thereby speeds up later passes.
386 The result of this pass is two databases:
388 - 'cvs-item-to-changeset.dat', which maps CVSItem ids to the id of
389 the changeset containing the item, and
391 - 'changesets.pck' and 'changesets-index.dat', which contain the
392 changeset objects themselves, indexed by changeset id.
394 - 'cvs-items-sorted-index.dat' and 'cvs-items-sorted.pck', which
395 contain the pickled CVSItems ordered by changeset.
398 BreakRevisionChangesetCyclesPass
399 ================================
401 There can still be cycles in the dependency graph of
402 RevisionChangesets caused by:
404 - Interleaved commits. Since CVS commits are not atomic, it can
405 happen that two commits are in progress at the same time and each
406 alters the same two files, but in different orders. These should
407 be small cycles involving only a few revision changesets. To
408 resolve these cycles, one or more of the RevisionChangesets have
409 to be split up (eventually becoming separate svn commits).
411 - Cycles involving a RevisionChangeset formed by the accidental
412 combination of unrelated items within a short period of time that
413 have the same author and log message. These should also be small
414 cycles involving only a few changesets.
416 The job of this pass is to break up such cycles (those involving only
419 This pass works by building up the graph of revision changesets and
420 their dependencies in memory, then attempting a topological sort of
421 the changesets. Whenever the topological sort stalls, that implies
422 the existence of a cycle, one of which can easily be determined. This
423 cycle is broken through the use of heuristics that try to determine an
424 "efficient" way of splitting one or more of the changesets that are
427 The new RevisionChangesets are written to
428 'cvs-item-to-changeset-revbroken.dat', 'changesets-revbroken.pck', and
429 'changesets-revbroken-index.dat', along with the unmodified
430 SymbolChangesets. These files are in the same format as the analogous
431 files produced by InitializeChangesetsPass.
434 RevisionTopologicalSortPass
435 ===========================
437 Topologically sort the RevisionChangesets, thereby picking the order
438 in which the RevisionChangesets will be committed. (Since the
439 previous pass eliminated any dependency cycles, this sort is
440 guaranteed to succeed.) Ambiguities in the topological sort are
441 resolved using the changesets' timestamps. Then simplify the
442 changeset graph into a linear chain by converting each
443 RevisionChangeset into an OrderedChangeset that stores dependency
444 links only to its commit-order predecessor and successor. This
445 simplified graph enforces the commit order that resulted from the
446 topological sort, even after the SymbolChangesets are added back into
447 the graph later. Store the OrderedChangesets into
448 'changesets-revsorted.pck' and 'changesets-revsorted-index.dat' along
449 with the unmodified SymbolChangesets.
452 BreakSymbolChangesetCyclesPass
453 ==============================
455 It is possible for there to be cycles in the graph of SymbolChangesets
458 - Split creation of branches. It is possible that branch A depends
459 on branch B in one file, but B depends on A in another file.
460 These cycles can be large, but they only involve
463 Break up such dependency loops. Output the results to
464 'cvs-item-to-changeset-symbroken.dat',
465 'changesets-symbroken-index.dat', and 'changesets-symbroken.pck'.
468 BreakAllChangesetCyclesPass
469 ===========================
471 The complete changeset graph (including both RevisionChangesets and
472 BranchChangesets) can still have dependency cycles cause by:
474 - Split creation of branches. The same branch tag can be added to
475 different files at completely different times. It is possible
476 that the revision that was branched later depends on a
477 RevisionChangeset that involves a file on the branch that was
478 created earlier. These cycles can be large, but they always
479 involve a SymbolChangeset. To resolve these cycles, the
480 SymbolChangeset is split up into two changesets.
482 In fact, tag changesets do not have to be considered--CVSTags cannot
483 participate in dependency cycles because no other CVSItem can depend
486 Since the input of this pass has been through
487 RevisionTopologicalSortPass, all revision cycles have already been
488 broken up and the order that the RevisionChangesets will be committed
489 has been determined. In this pass, the complete changeset graph is
490 created in memory, including the linear list of OrderedChangesets from
491 RevisionTopologicalSortPass plus all of the symbol changesets.
492 Because this pass doesn't break up any OrderedChangesets, it is
493 constrained to finding places within the revision changeset sequence
494 in which the symbol changeset commits can be inserted.
496 The new changesets are written to
497 'cvs-item-to-changeset-allbroken.dat', 'changesets-allbroken.pck', and
498 'changesets-allbroken-index.dat', which are in the same format as the
499 analogous files produced by InitializeChangesetsPass.
505 Now that the earlier passes have broken up any dependency cycles among
506 the changesets, it is possible to order all of the changesets in such
507 a way that all of a changeset's dependencies are committed before the
508 changeset itself. This pass does so by again building up the graph of
509 changesets in memory, then at each step picking a changeset that has
510 no remaining dependencies and removing it from the graph. Whenever
511 more than one dependency-free changeset is available, symbol
512 changesets are chosen before revision changesets. As changesets are
513 processed, the timestamp sequence is ensured to be monotonic by the
514 simple expedient of adjusting retrograde timestamps to be later than
515 their predecessor. Timestamps that lie in the future, on the other
516 hand, are assumed to be bogus and are adjusted backwards, also to be
517 just later than their predecessor.
519 This pass writes a line to 'changesets-s.txt' for each
520 RevisionChangeset, in the order that the changesets should be
521 committed. Each lines contains
523 CHANGESET_ID TIMESTAMP
525 where CHANGESET_ID is the id of the changeset in the
526 'changesets-allbroken' databases and TIMESTAMP is the timstamp that
527 should be assigned to it when it is committed. Both values are
528 written in hexadecimal.
531 CreateRevsPass (formerly called pass5)
534 This pass generates SVNCommits from Changesets and records symbol
535 openings and closings. (One Changeset can result in multiple
536 SVNCommits, for example if it causes symbols to be filled or copies to
539 This pass does the following:
541 1. Creates a database file to map Subversion revision numbers to
542 SVNCommit instances ('svn-commits-index.dat' and
543 'svn-commits.pck'). Creates another database file to map CVS
544 Revisions to their Subversion Revision numbers
545 ('cvs-revs-to-svn-revnums.db').
547 2. When a file is copied to a symbolic name in cvs2svn, it is copied
548 from a specific source: either a CVSRevision, or a copy created by
549 a previous CVSBranch of the file. The copy has to be made from an
550 SVN revision that is during the lifetime of the source. The SVN
551 revision when the source was created is called the symbol's
552 "opening", and the SVN revision when it was deleted or overwritten
553 is called the symbol's "closing". In this pass, the
554 SymbolingsLogger class writes out a line to 'symbolic-names.txt'
555 for each symbol opening or closing. Note that some openings do not
556 have closings, namely if the corresponding source is still present
557 at the HEAD revision.
559 The format of each line is:
561 SYMBOL_ID SVN_REVNUM TYPE CVS_SYMBOL_ID
570 Here is what the columns mean:
572 SYMBOL_ID -- The id of the branch or tag that has an opening in
573 this SVN_REVNUM, in hexadecimal.
575 SVN_REVNUM -- The Subversion revision number in which the opening
576 or closing occurred. (There can be multiple openings and
577 closings per SVN_REVNUM).
579 TYPE -- "O" for openings and "C" for closings.
581 CVS_SYMBOL_ID -- The id of the CVSSymbol instance whose opening or
582 closing is being described, in hexadecimal.
584 Each CVSSymbol that tags a non-dead file has exactly one opening
585 and either zero or one closing. The closing, if it exists, always
586 occurs in a later SVN revision than the opening.
588 See SymbolingsLogger for more details.
591 SortSymbolsPass (formerly called pass6)
594 This pass sorts 'symbolic-names.txt' into 'symbolic-names-s.txt'.
595 This orders the file first by symbol ID, and second by Subversion
596 revision number, thus grouping all openings and closings for each
597 symbolic name together.
600 IndexSymbolsPass (formerly called pass7)
603 This pass iterates through all the lines in 'symbolic-names-s.txt',
604 writing out a pickle file ('symbol-offsets.pck') mapping SYMBOL_ID to
605 the file offset in 'symbolic-names-s.txt' where SYMBOL_ID is first
606 encountered. This will allow us to seek to the various offsets in the
607 file and sequentially read only the openings and closings that we
611 OutputPass (formerly called pass8)
614 This pass opens the svn-commits database and sequentially plays out
615 all the commits to either a Subversion repository or to a dumpfile.
616 It also decides what sources to use to fill symbols.
618 In --dumpfile mode, the result of this pass is a Subversion repository
619 dumpfile (suitable for input to 'svnadmin load'). The dumpfile is the
620 data's last static stage: last chance to check over the data, run it
621 through svndumpfilter, move the dumpfile to another machine, etc.
623 When not in --dumpfile mode, no full dumpfile is created. Instead,
624 miniature dumpfiles representing a single revisions are created,
625 loaded into the repository, and then removed.
627 In both modes, the dumpfile revisions are created by walking through
630 The database 'mirror-nodes.db' holds a skeletal mirror of the
631 repository structure at each SVN revision. This mirror keeps track of
632 which files existed on each LOD, but does not record any file
633 contents. cvs2svn requires this information to decide which paths to
634 copy when filling branches and tags.
636 When .cvsignore files are modified, cvs2svn computes the corresponding
637 svn:ignore properties and applies the properties to the parent
638 directory. The .cvsignore files themselves are not included in the
639 output unless the --keep-cvsignore option was specified. But in
640 either case, the .cvsignore files are recorded within the repository
641 mirror as if they were being written to disk, to ensure that the
642 containing directory is not pruned if the directory in CVS still
643 contained a .cvsignore file.
646 ===============================
647 Branches and Tags Plan.
648 ===============================
650 This pass is also where tag and branch creation is done. Since
651 subversion does tags and branches by copying from existing revisions
652 (then maybe editing the copy, making subcopies underneath, etc), the
653 big question for cvs2svn is how to achieve the minimum number of
654 operations per creation. For example, if it's possible to get the
655 right tag by just copying revision 53, then it's better to do that
656 than, say, copying revision 51 and then sub-copying in bits of
659 Tags are created as soon as cvs2svn encounters the last CVS Revision
660 that is a source for that tag. The whole tag is created in one
663 Branches are created as soon as all of their prerequisites are in
664 place. If a branch creation had to be broken up due to dependency
665 cycles, then non-final parts are also created as soon as their
666 prerequisites are ready. In such a case, the SymbolChangeset
667 specifies how much of the branch can be created in each step.
669 How just-in-time branch creation works:
671 In order to make the "best" set of copies/deletes when creating a
672 branch, cvs2svn keeps track of two sets of trees while it's making
675 1. A skeleton mirror of the subversion repository, that is, a
676 record of which file existed on which LOD for each SVN revision.
678 2. A tree for each CVS symbolic name, and the svn file/directory
679 revisions from which various parts of that tree could be copied.
681 Each LOD is recorded as a tree using the following schema: unique keys
682 map to marshal.dumps() representations of dictionaries, which in turn
683 map path component names to other unique keys:
685 root_key ==> { entryname1 : entrykey1, entryname2 : entrykey2, ... }
686 entrykey1 ==> { entrynameX : entrykeyX, ... }
687 entrykey2 ==> { entrynameY : entrykeyY, ... }
688 entrykeyX ==> { etc, etc ...}
689 entrykeyY ==> { etc, etc ...}
691 (The leaf nodes -- files -- are represented by None.)
693 The repository mirror allows cvs2svn to remember what paths exist in
696 For details on how branches and tags are created, please see the
697 docstring the SymbolingsLogger class (and its methods).
699 -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*-
700 - -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -
701 -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*-
703 Some older notes and ideas about cvs2svn. Not deleted, because they
704 may contain suggestions for future improvements in design.
706 -----------------------------------------------------------------------
708 An email from John Gardiner Myers <jgmyers@speakeasy.net> about some
709 considerations for the tool.
712 From: John Gardiner Myers <jgmyers@speakeasy.net>
713 Subject: Thoughts on CVS to SVN conversion
715 Date: Sun, 15 Apr 2001 17:47:10 -0700
717 Some things you may want to consider for a CVS to SVN conversion utility:
719 If converting a CVS repository to SVN takes days, it would be good for
720 the conversion utility to keep its progress state on disk. If the
721 conversion fails halfway through due to a network outage or power
722 failure, that would allow the conversion to be resumed where it left off
723 instead of having to start over from an empty SVN repository.
725 It is a short step from there to allowing periodic updates of a
726 read-only SVN repository from a read/write CVS repository. This allows
727 the more relaxed conversion procedure:
729 1) Create SVN repository writable only by the conversion tool.
730 2) Update SVN repository from CVS repository.
731 3) Announce the time of CVS to SVN cutover.
732 4) Repeat step (2) as needed.
733 5) Disable commits to CVS repository, making it read-only.
735 7) Enable commits to SVN repository.
736 8) Wait for developers to move their workspaces to SVN.
737 9) Decomission the CVS repository.
739 You may forward this message or parts of it as you seem fit.
742 -----------------------------------------------------------------------
744 Further design thoughts from Greg Stein <gstein@lyra.org>
746 * timestamp the beginning of the process. ignore any commits that
747 occur after that timestamp; otherwise, you could miss portions of a
748 commit (e.g. scan A; commit occurs to A and B; scan B; create SVN
749 revision for items in B; we missed A)
751 * the above timestamp can also be used for John's "grab any updates
752 that were missed in the previous pass."
754 * for each file processed, watch out for simultaneous commits. this
755 may cause a problem during the reading/scanning/parsing of the file,
756 or the parse succeeds but the results are garbaged. this could be
757 fixed with a CVS lock, but I'd prefer read-only access.
759 algorithm: get the mtime before opening the file. if an error occurs
760 during reading, and the mtime has changed, then restart the file. if
761 the read is successful, but the mtime changed, then restart the
764 * use a separate log to track unique branches and non-branched forks
765 of revision history (Q: is it possible to create, say, 1.4.1.3
766 without a "real" branch?). this log can then be used to create a
767 /branches/ directory in the SVN repository.
769 Note: we want to determine some way to coalesce branches across
770 files. It can't be based on name, though, since the same branch name
771 could be used in multiple places, yet they are semantically
772 different branches. Given files R, S, and T with branch B, we can
773 tie those files' branch B into a "semantic group" whenever we see
774 commit groups on a branch touching multiple files. Files that are
775 have a (named) branch but no commits on it are simply ignored. For
776 each "semantic group" of a branch, we'd create a branch based on
777 their common ancestor, then make the changes on the children as
778 necessary. For single-file commits to a branch, we could use
779 heuristics (pathname analysis) to add these to a group (and log what
780 we did), or we could put them in a "reject" kind of file for a human
781 to tell us what to do (the human would edit a config file of some
782 kind to instruct the converter).
784 * if we have access to the CVSROOT/history, then we could process tags
785 properly. otherwise, we can only use heuristics or configuration
786 info to group up tags (branches can use commits; there are no
787 commits associated with tags)
789 * ideally, we store every bit of data from the ,v files to enable a
790 complete restoration of the CVS repository. this could be done by
791 storing properties with CVS revision numbers and stuff (i.e. all
792 metadata not already embodied by SVN would go into properties)
794 * how do we track the "states"? I presume "dead" is simply deleting
795 the entry from SVN. what are the other legal states, and do we need
796 to do anything with them?
798 * where do we put the "description"? how about locks, access list,
801 * note that using something like the SourceForge repository will be an
802 ideal test case. people *move* their repositories there, which means
803 that all kinds of stuff can be found in those repositories, from
804 wherever people used to run them, and under whatever development
805 policies may have been used.
807 For example: I found one of the projects with a "permissions 644;"
808 line in the "gnuplot" repository. Most RCS releases issue warnings
809 about that (although they properly handle/skip the lines), and CVS
810 ignores RCS newphrases altogether.