4 A cvs2svn run consists of eight passes. Each pass saves the data it
5 produces to files on disk, so that a) we don't hold huge amounts of
6 state in memory, and b) the conversion process is resumable.
8 CollectRevsPass (formerly called pass1)
11 The goal of this pass is to write a summary of each CVS file as a
12 pickled CVSFile to 'cvs2svn-cvs-files.db', and a summary of each CVS
13 file revision as a pickled CVSRevision to 'cvs2svn-cvs-revs.db'. In
14 each case, items are assigned an arbitrary key that is used to refer
17 We walk over the repository, collecting data about the RCS files into
18 an instance of CollectData. Each RCS file is processed with
19 rcsparse.parse(), which invokes callbacks from an instance of
20 cvs2svn's _FileDataCollector class (which is a subclass of
23 For each RCS file, the first thing the parser encounters is the
24 administrative header, including the head revision, the principal
25 branch, symbolic names, RCS comments, etc. The main thing that
26 happens here is that _FileDataCollector.define_tag() is invoked on
27 each symbolic name and its attached revision, so all the tags and
28 branches of this file get collected. When this stage is done, the
29 parser invokes admin_completed(), which writes the CVSFile to the
32 Next, the parser hits the revision summary section. That's the part
33 of the RCS file that looks like this:
36 date 2002.06.12.04.54.12; author captnmark; state Exp;
42 date 2002.05.28.18.02.11; author captnmark; state Exp;
48 For each revision summary, _FileDataCollector.define_revision() is
49 invoked, recording that revision's metadata in various variables of
50 the _FileDataCollector class instance.
52 After finishing the revision summaries, the parser invokes
53 _FileDataCollector.tree_completed(), which loops over the revision
54 information stored, determining if there are instances where a higher
55 revision was committed "before" a lower one (rare, but it can happen
56 when there was clock skew on the repository machine). If there are
57 any, it "resyncs" the timestamp of the earlier rev to be just before
58 that of the later rev, but saves the original timestamp in
59 self._rev_data[blah].original_timestamp, so we can later write out a
60 record to the resync file indicating that an adjustment was made (this
61 makes it possible to catch the other parts of this commit and resync
62 them similarly; more details below).
64 Next, the parser encounters the *real* revision data, which has the
65 log messages and file contents. For each revision, it invokes
66 _FileDataCollector.set_revision_info(), which writes a record to
67 'cvs2svn-cvs-revs.db'.
69 Also, for resync'd revisions, a line like this is written out to
72 3d6c1329 18a215a05abea1c6c155dcc7283b88ae7ce23502 3d6c1328
76 NEW_TIMESTAMP DIGEST OLD_TIMESTAMP
78 (The resync file will be explained later.)
80 That's it -- the RCS file is done.
82 When every CVS file is done, CollectRevsPass is complete, and:
84 - 'cvs2svn-cvs-files.db' contains a record of every CVS file.
86 - 'cvs2svn-cvs-revs.db' contains a summary of every revision to
87 every CVS file, including a reference to the corresponding CVS
88 file record in 'cvs2svn-cvs-files.db'. The order of the
89 revisions is arbitrary. In other words, a multi-file commit will
90 be scattered all over the place.
92 - 'cvs2svn-a-revs.txt' contains a list of CVSRevision keys that are
93 in 'cvs2svn-cvs-revs.db', in the order that they were written.
95 - 'cvs2svn-resync.txt' contains a small amount of resync data, in
98 - 'cvs2svn-branches.txt' ???
99 - 'cvs2svn-tags.txt' ???
100 - 'cvs2svn-metadata.db' ???
103 ResyncRevsPass (formerly called pass2)
106 This is where the resync file is used. The goal of this pass is to
107 output the information from cvs2svn-cvs-revs.db to a new file,
108 'cvs2svn-cvs-revs-resync.db' (clean revs). It has the same content as
109 the original file, except for some resync'd timestamps.
111 First, read the whole resync file into a hash table that maps each
112 author+log digest to a list of lists. Each sublist represents one of
113 the timestamp adjustments from CollectRevsPass, and looks like this:
115 [old_time_lower, old_time_upper, new_time]
117 The reason to map each digest to a list of sublists, instead of to one
118 list, is that sometimes you'll get the same digest for unrelated
119 commits (for example, the same author commits many times using the
120 empty log message, or a log message that just says "Doc tweaks."). So
121 each digest may need to "fan out" to cover multiple commits, but
122 without accidentally unifying those commits.
124 Now we loop over 'cvs2svn-cvs-revs.db', and for each record write a
125 line to 'cvs2svn-data.c-revs.txt'. Each line of this file looks like
128 3dc32955 5afe9b4ba41843d8eb52ae7db47a43eaa9573254 12ab
132 1. a fixed-width timestamp
133 2. a digest of the log message + author
134 3. the integer unique ID for this CVSRevision, as a hexadecimal
137 Any CVSRevision records in 'cvs2svn-cvs-revs.db' whose digest matches
138 some resync entry and appear to be part of the same commit as one of
139 the sublists in that entry, get tweaked. The tweak is to adjust the
140 commit time of the line to the new_time, which is taken from the
141 resync hash and results from the adjustment described in
144 The way we figure out whether a given line needs to be tweaked is to
145 loop over all the sublists, seeing if this commit's original time
146 falls within the old<-->new time range for the current sublist. If it
147 does, we tweak the line before writing it out, and then conditionally
148 adjust the sublist's range to account for the timestamp we just
149 adjusted (since it could be an outlier). Note that this could, in
150 theory, result in separate commits being accidentally unified, since
151 we might gradually adjust the two sides of the range such that they are
152 eventually more than COMMIT_THRESHOLD seconds apart. However, this is
153 really a case of CVS not recording enough information to disambiguate
154 the commits; we'd know we have a time range that exceeds the
155 COMMIT_THRESHOLD, but we wouldn't necessarily know where to divide it
156 up. We could try some clever heuristic, but for now it's not
157 important -- after all, we're talking about commits that weren't
158 important enough to have a distinctive log message anyway, so does it
159 really matter if a couple of them accidentally get unified? Probably
162 'cvs2svn-tags.db' ??? is also created during this pass, for
163 undocumented purposes.
166 SortRevsPass (formerly called pass3)
169 This is where we deduce the changesets, that is, the grouping of file
170 changes into single commits.
172 It's very simple -- run 'sort' on 'cvs2svn-c-revs.txt', converting it
173 to 'cvs2svn-s-revs.txt'. Because of the way the data is laid out,
174 this causes commits with the same digest (that is, the same author and
175 log message) to be grouped together. Poof! We now have the CVS
176 changes grouped by logical commit.
178 In some cases, the changes in a given commit may be interleaved with
179 other commits that went on at the same time, because the sort gives
180 precedence to date before log digest. However, CreateDatabasesPass
181 detects this by seeing that the log digest is different, and
182 reseparates the commits.
185 CreateDatabasesPass (formerly called pass4):
188 Find and create a database containing the last CVS revision that is a
189 source (also referred to as an "opening" revision) for all symbolic
190 names. This will result in a database containing key-value pairs
191 whose key is the id for a CVSRevision, and whose value is a list of
192 symbolic names for which that CVSRevision is the last "opening."
194 The format for this file is:
196 'cvs2svn-symbol-last-cvs-revs.db':
198 CVS Revision ID array of Symbolic names
202 5c --> ['TAG11', 'BRANCH38']
204 4d --> ['BRANCH48', 'BRANCH37']
205 f --> ['TAG320', 'TAG1178']
208 AggregateRevsPass (formerly called pass5)
211 Primarily, this pass gathers CVS revisions into Subversion revisions
212 (a Subversion revision is comprised of one or more CVS revisions)
213 before we actually begin committing (where "committing" means either
214 to a Subversion repository or to a dump file).
216 This pass does the following:
218 1. Creates a database file to map Subversion Revision numbers to their
219 corresponding CVS Revisions ('cvs2svn-svn-revnums-to-cvs-revs.db').
220 Creates another database file to map CVS Revisions to their
221 Subversion Revision numbers ('cvs2svn-cvs-revs-to-svn-revnums.db').
223 2. When a file is copied to a symbolic name in cvs2svn, there are a
224 range of valid Subversion revisions that we can copy the file from.
225 The first valid Subversion revision number for a symbolic name is
226 called the "Opening", and the first *invalid* Subversion revision
227 number encountered after the "Opening" is called the "Closing". In
228 this pass, the SymbolingsLogger class writes one line to
229 'cvs2svn-symbolic-names.txt' per CVS file, per symbolic name, per
232 3. For each CVS Revision in s-revs, we write out a line (for each
233 symbolic name that it opens) to cvs2svn-symbolic-names.txt if it is
234 the first possible source revision (the "opening" revision) for a
235 copy to create a branch or tag, or if it is the last possible
236 revision (the "closing" revision) for a copy to create a branch or
237 tag. Not every opening will have a corresponding closing.
239 The format of each line is:
241 SYMBOLIC_NAME SVN_REVNUM TYPE BRANCH_NAME CVS_FILE_ID
246 MY_BRANCH3 245 O * 1a9
247 MY_TAG1 241 C MY_BRANCH1 1a7
248 MY_BRANCH_BLAH 201 O MY_BRANCH1 1b3
250 Here is what the columns mean:
252 SYMBOLIC_NAME: The name of the branch or tag that starts or ends
253 in this CVS Revision (there can be multiples per
256 SVN_REVNUM: The Subversion revision number that is the opening or
257 closing for this SYMBOLIC_NAME.
259 TYPE: "O" for Openings and "C" for Closings.
261 BRANCH_NAME: The (uncleaned) branch name where this opening or
262 closing happened. '*' denotes the default branch.
264 CVS_FILE_ID: The ID of the CVS file where this opening or closing
265 happened, in hexadecimal.
267 See SymbolingsLogger for more details.
269 4. Creates a file 'cvs2svn-symbolic-names-closings-tmp.txt' ??? for
270 undocumented purposes.
273 SortSymbolsPass (formerly called pass6)
276 This pass merely sorts 'cvs2svn-symbolic-names.txt' into
277 'cvs2svn-symbolic-names-s.txt'. This orders the file first by
278 symbolic name, and second by Subversion revision number, thus grouping
279 all openings and closings for each symbolic name together.
282 IndexSymbolsPass (formerly called pass7)
285 This pass iterates through all the lines in
286 'cvs2svn-symbolic-names-s.txt', writing out a database file
287 ('cvs2svn-symbolic-name-offsets.db') mapping SYMBOLIC_NAME to the file
288 offset in 'cvs2svn-symbolic-names-s.txt' where SYMBOLIC_NAME is first
289 encountered. This will allow us to seek to the various offsets in the
290 file and sequentially read only the openings and closings that we
294 OutputPass (formerly called pass8)
297 This pass has very little "thinking" to do--it basically opens the
298 svn-nums-to-cvs-revs.db and, starting with Subversion revision 2
299 (revision 1 creates /trunk, /tags, and /branches), sequentially plays
300 out all the commits to either a Subversion repository or to a
303 In --dump-only mode, the result of this pass is a Subversion
304 repository dumpfile (suitable for input to 'svnadmin load'). The
305 dumpfile is the data's last static stage: last chance to check over
306 the data, run it through svndumpfilter, move the dumpfile to another
309 However, when not in --dump-only mode, no full dumpfile is created for
310 subsequent load into a Subversion repository. Instead, miniature
311 dumpfiles representing a single revision are created, loaded into the
312 repository, and then removed.
314 In both modes, the dumpfile revisions are created by walking through
315 'cvs2svn-data.s-revs.txt'.
317 The databases 'cvs2svn-svn-nodes.db' and 'cvs2svn-svn-revisions.db'
318 form a skeletal (metadata only, no content) mirror of the repository
319 structure that cvs2svn is creating. They provide data about previous
320 revisions that cvs2svn requires while constructing the dumpstream.
323 ===============================
324 Branches and Tags Plan.
325 ===============================
327 This pass is also where tag and branch creation is done. Since
328 subversion does tags and branches by copying from existing revisions
329 (then maybe editing the copy, making subcopies underneath, etc), the
330 big question for cvs2svn is how to achieve the minimum number of
331 operations per creation. For example, if it's possible to get the
332 right tag by just copying revision 53, then it's better to do that
333 than, say, copying revision 51 and then sub-copying in bits of
336 Also, since CVS does not version symbolic names, there is the
337 secondary question of *when* to create a particular tag or branch.
338 For example, a tag might have been made at any time after the youngest
339 commit included in it, or might even have been made piecemeal; and the
340 same is true for a branch, with the added constraint that for any
341 particular file, the branch must have been created before the first
342 commit on the branch.
344 Answering the second question first: cvs2svn creates tags as soon as
345 possible and branches as late as possible.
347 Tags are created as soon as cvs2svn encounters the last CVS Revision
348 that is a source for that tag. The whole tag is created in one
351 For branches, this is "just in time" creation -- the moment it sees
352 the first commit on a branch, it snaps the entire branch into
353 existence (or as much of it as possible), and then outputs the branch
356 The reason we say "as much of it as possible" is that it's possible to
357 have a branch where some files have branch commits occuring earlier
358 than the other files even have the source revisions from which the
359 branch sprouts (this can happen if the branch was created piecemeal,
360 for example). In this case, we create as much of the branch as we
361 can, that is, as much of it as there are source revisions available to
362 copy, and leave the rest for later. "Later" might mean just until
363 other branch commits come in, or else during a cleanup stage that
364 happens at the end of this pass (about which more later).
366 How just-in-time branch creation works:
368 In order to make the "best" set of copies/deletes when creating a
369 branch, cvs2svn keeps track of two sets of trees while it's making
372 1. A skeleton mirror of the subversion repository, that is, an
373 array of revisions, with a tree hanging off each revision. (The
374 "array" is actually implemented as an anydbm database itself,
375 mapping string representations of numbers to root keys.)
377 2. A tree for each CVS symbolic name, and the svn file/directory
378 revisions from which various parts of that tree could be copied.
380 Both tree sets live in anydbm databases, using the same basic schema:
381 unique keys map to marshal.dumps() representations of dictionaries,
382 which in turn map entry names to other unique keys:
384 root_key ==> { entryname1 : entrykey1, entryname2 : entrykey2, ... }
385 entrykey1 ==> { entrynameX : entrykeyX, ... }
386 entrykey2 ==> { entrynameY : entrykeyY, ... }
387 entrykeyX ==> { etc, etc ...}
388 entrykeyY ==> { etc, etc ...}
390 (The leaf nodes -- files -- are also dictionaries, for simplicity.)
392 The repository mirror allows cvs2svn to remember what paths exist in
395 For details on how branches and tags are created, please see the
396 docstring the SymbolingsLogger class (and its methods).
398 -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*-
399 - -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -
400 -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*-
402 Some older notes and ideas about cvs2svn. Not deleted, because they
403 may contain suggestions for future improvements in design.
405 -----------------------------------------------------------------------
407 An email from John Gardiner Myers <jgmyers@speakeasy.net> about some
408 considerations for the tool.
411 From: John Gardiner Myers <jgmyers@speakeasy.net>
412 Subject: Thoughts on CVS to SVN conversion
414 Date: Sun, 15 Apr 2001 17:47:10 -0700
416 Some things you may want to consider for a CVS to SVN conversion utility:
418 If converting a CVS repository to SVN takes days, it would be good for
419 the conversion utility to keep its progress state on disk. If the
420 conversion fails halfway through due to a network outage or power
421 failure, that would allow the conversion to be resumed where it left off
422 instead of having to start over from an empty SVN repository.
424 It is a short step from there to allowing periodic updates of a
425 read-only SVN repository from a read/write CVS repository. This allows
426 the more relaxed conversion procedure:
428 1) Create SVN repository writable only by the conversion tool.
429 2) Update SVN repository from CVS repository.
430 3) Announce the time of CVS to SVN cutover.
431 4) Repeat step (2) as needed.
432 5) Disable commits to CVS repository, making it read-only.
434 7) Enable commits to SVN repository.
435 8) Wait for developers to move their workspaces to SVN.
436 9) Decomission the CVS repository.
438 You may forward this message or parts of it as you seem fit.
441 -----------------------------------------------------------------------
443 Further design thoughts from Greg Stein <gstein@lyra.org>
445 * timestamp the beginning of the process. ignore any commits that
446 occur after that timestamp; otherwise, you could miss portions of a
447 commit (e.g. scan A; commit occurs to A and B; scan B; create SVN
448 revision for items in B; we missed A)
450 * the above timestamp can also be used for John's "grab any updates
451 that were missed in the previous pass."
453 * for each file processed, watch out for simultaneous commits. this
454 may cause a problem during the reading/scanning/parsing of the file,
455 or the parse succeeds but the results are garbaged. this could be
456 fixed with a CVS lock, but I'd prefer read-only access.
458 algorithm: get the mtime before opening the file. if an error occurs
459 during reading, and the mtime has changed, then restart the file. if
460 the read is successful, but the mtime changed, then restart the
463 * use a separate log to track unique branches and non-branched forks
464 of revision history (Q: is it possible to create, say, 1.4.1.3
465 without a "real" branch?). this log can then be used to create a
466 /branches/ directory in the SVN repository.
468 Note: we want to determine some way to coalesce branches across
469 files. It can't be based on name, though, since the same branch name
470 could be used in multiple places, yet they are semantically
471 different branches. Given files R, S, and T with branch B, we can
472 tie those files' branch B into a "semantic group" whenever we see
473 commit groups on a branch touching multiple files. Files that are
474 have a (named) branch but no commits on it are simply ignored. For
475 each "semantic group" of a branch, we'd create a branch based on
476 their common ancestor, then make the changes on the children as
477 necessary. For single-file commits to a branch, we could use
478 heuristics (pathname analysis) to add these to a group (and log what
479 we did), or we could put them in a "reject" kind of file for a human
480 to tell us what to do (the human would edit a config file of some
481 kind to instruct the converter).
483 * if we have access to the CVSROOT/history, then we could process tags
484 properly. otherwise, we can only use heuristics or configuration
485 info to group up tags (branches can use commits; there are no
486 commits associated with tags)
488 * ideally, we store every bit of data from the ,v files to enable a
489 complete restoration of the CVS repository. this could be done by
490 storing properties with CVS revision numbers and stuff (i.e. all
491 metadata not already embodied by SVN would go into properties)
493 * how do we track the "states"? I presume "dead" is simply deleting
494 the entry from SVN. what are the other legal states, and do we need
495 to do anything with them?
497 * where do we put the "description"? how about locks, access list,
500 * note that using something like the SourceForge repository will be an
501 ideal test case. people *move* their repositories there, which means
502 that all kinds of stuff can be found in those repositories, from
503 wherever people used to run them, and under whatever development
504 policies may have been used.
506 For example: I found one of the projects with a "permissions 644;"
507 line in the "gnuplot" repository. Most RCS releases issue warnings
508 about that (although they properly handle/skip the lines), and CVS
509 ignores RCS newphrases altogether.