Introduction, or why is it so hard to migrate from CVS to a modern SCM

One thing which needs to be defined within our migration project is how much we will need to convert to the next OpenOffice.org SCM system - which will be Subversion first and later most probably a DSCM.

All modern open source SCM systems I know of are change set oriented. This means that not files are individually tracked but change sets, which can consist of changes to a real huge number of files. One benefit of change sets is obvious: changes in the structure of files (directories, renames etc) can be recorded as well. Another very important benefit is consistency: changes which belong together are recorded together in one place.

Consider a CVS branch or tag operation over the whole source tree: all active repository files (these which are not in /Attic) are marked with the tag or branch label, involving a rewrite of the complete active archive. Some 12 GiB or so in the case of OpenOffice.org. This is not only incredible slow but also quite unsafe, because often many thousands individual files are involved in recording one simple thing like a branch or tag label. If something happens during such operations the repository as a whole will be left in a corrupt state.

All CVS repositories of notable size are corrupted in one way or another. This is usually not a problem and rarely noted in daily CVS usage, but can be a problem when trying to recreate a historical state of the project. Or when trying to import a project history in a new SCM system.

CVS best practices recommend to never move a tag or - even worse - a branch after they have been created to keep a repository consistent. But moving tags and branches are at the heart of the CWS "resync" mechanism we employ for OpenOffice.org. We can expect a certain amount of repository corruption within the OOo repository.

What needs to be done when migrating the project history from an "old style" CVS repository to a new change set based repository? First one needs to identify which change in which file belongs to one change set. The naive way is to represent each change in each file as an individual change set. This can lead to incredible blown up new repository because there is a certain overhead per change set. Some change set based SCMs are better in this respect are better than others, but still, representing say, an identical license change in 10000 files as 10000 individual change sets is going to be wasteful. How does a conversion tool recognize if individual changes in different files belong together? Well, they might be probably belong together if a) the revision comment is identical and b) the commit time is, well, within a certain time span. Remember, CVS stores the history in individual repository files which do not know from each other, so extracting "correct" change sets is going to be an imprecise science at best. Conversion tools employ quite a bit of heuristics for this. Oh and just going on chronological is not an option because unrelated changes could have been committed at the same time, so it is possible that a long commit of say 1000 files is interleaved with a commit of one file which is completely unrelated.

Is it important that a conversion tool extracts "correct and minimal" change sets? Well no, because in case of doubt it's always possible to represent a logical change which spans two or more files as two or more change sets at the cost of some overhead. But there is an important constraint to be observed: CVS tags and branches need to be always correctly represented in the new repository. After all these are used to reconstruct historical states of the project.

There are a number of conversion tools

Scm migration scope

Introduction, or why is it so hard to migrate from CVS to a modern SCM

How much to convert

Views

Personal tools

Navigation

Search

Tools