SCM Migration
Glossary
The exact meaning of two terms is essential for the following migration guide:
- project: a top level project, with a project lead and a separate space on the OpenOffice.org web site and in the OpenOffice.org repository. Example: gsl (the project which hosts the vcl code module, rsc the resource compiler and 16 other code modules), zh (this project hosts the Chinese language community).
- module: the next level of structure is called module. Code projects typically host several modules, language projects usually have only a www module. Attention: Some modules are called like their hosting project, for example sw is also a module in the project sw.
Repository restructuring
Whether we take subversion as new SCM tool or a distributed SCM like git, bazaar or mercurial, the necessary migration is also a good opportunity to restructure our repository and to do some badly needly clean up.
This restructure and migration guide is geared towards a migration to subversion, but the same principles can and should be applied to a potential migration to another SCM tool.
Currently we have 141 top level projects. Inside these projects we have varying numbers of modules, either web content modules or code modules. Many projects are dedicated to the OpenOffice.org language communities which are essentially independent from each other. Modules from code projects on the other hand are highly dependent from each other. We got 260 of these code modules (some of them historical).
The idea is to move all modules containing OOo source code into a single repository. After that, each project get it's own repository, which is mostly for web content. After the migration, modules inside the new "code project" get linked into their original projects, to maintain the integrity of these projects.
Clean up
In 6 years we accumulated a lot of cruft in the CVS repository. We take the opportunity and skip some dead ends from the migration. The rule is, that every released version of OOo must be represented in the new repository. Otherwise we are pretty free to define what we want to migrate and what not. Currently I plan to implement the following strategy:
- migrate all releases of OOo to the new SCM, this means release tags and branches must be preserved
- skip experimental branches and tags if they can be proven to be obsolete
- skip obvious dead parts of the repository
- skip tags and branches of all CWS with status integrated, finished, deleted or canceled at a certain date (currently the date is 2007/05/15)
The last rule reduces the number of the to be migrated branches from about 5000 to about 500.
Recipe for migrating the code repository
For a migration to subversion a fast Unix machine with cvs, subversion-1.4 and the cvs2svn python script installed is needed.
Copy and restructure the CVS code repository
Create a copy of the CVS repository. In the following <work> is the directory which contains the 136 OOo top level projects.
Compare the repository with the reference module list
I've prepared three files to help in the migration. The repository structure document repositorystructure.ods contains the new structure, the script repositorystructure.sh moves code modules into the new code project and removes obsolete stuff.
The OOo repository is constantly growing. Before you start the migration it's mandatory to check if new projects or modules have been added to the CVS repository. Compare the directory with repositorystructure.txt.reference.
$ cd <work> $ echo */* | sed -e 'y/ /\n/' | sort > repositorystructure.txt $ diff repositorystructure.txt.reference repositorystructure.txt
If new projects and/or modules have been added, please add them to repositorystructure.sh and repositorystructure.ods according the above mentioned principles and don't forget to upgrade the reference repositorystructure.txt.reference, too.
Restructure repository
Use the repositorystructure.sh script to restructure the repository.
$ cd <work> $ sh repositorystructure.sh
This script moves all modules with source code into the new code project but leaves the language projects alone. Additionally it removes some cruft from the module level of the repository, like nonsensical empty modules etc.
Repository clean up
The CVS repository contains a number of broken CVS archives, which fall in three categories:
- Files which do not contain any revisions, but just the RCS header. These files are no valid RCS files and can safely be removed.
- Files which are present in the a <dir> and also <dir/Attic>, for example hu/hu-po/crashrep.po,v and hu/hu-po/Attic/crashrep.po,v. Here a decision has to be made which version is the right one, the other one must be removed. The one which we will keep is the one with the higher head revision number.
- Files which have a tag/branch on a deleted revision. The tags/branches have to be removed via the rcs command.
The script cleanbrokenfiles.sh does the cleanup. It requires that the RCS command rcs can be found in $PATH.
$ cd <work> $ sh cleanbrokenfiles.sh
It's quite likely that there are more cases with files present in <dir> and <dir/Attic> in the meantime. These can be found with the python script finddouble.py (requires python 2.5).
$ python finddouble.py
If this tool prints one or more lines, please add them to the cleanbrokenfiles.sh script and rerun the clean script. The output of finddouble.py is formated in a way that the line(s) can be directly added to the clean script.
Converting the repository
We convert the repository with python script cvs2svn-1.5.1. This script has quite a few dependencies, it needs a working berkelydb, berkelydb python bindings, rcs tools in path, subversion-1.4, and subversion python bindings. cvs2svn is very flexible and detects quite a few CVS repository inconsistencies. It has 9 passes. Many problems will be found during the first pass which parses all the *,v CVS archives. If you encounter a problem in this pass you'll have to go back to the last section and add the affected files to the clean up script and restart the conversion afterwards.
Converting the language projects
Please use the script convert.sh. The first section of the script contains a few paths, please adapt them to your needs. The script iterate over all 138 language projects and creates 138 subversion repositories. If in the meantime more language projects have been introduced add them to this script accordingly. The script assumes that the cvs2svn script is in your path. It will create subversion repositories in the berkelydb format.
$ cd .. $ sh convert.sh
Converting the code modules
For converting the code modules we'll need the full flexibility of cvs2svn. cvs2svn can be customized via a so called "option" file. This files contains python instructions on how exactly the conversion should be done. We need customization for:
- excluding old and/or obsolete tags and branch names (called symbolic names by cvs2svn)
- force mixed tag/branch symbolic names to either a tag or a branch
- resolve the problem of "Blocked Exclusion", that is symbolic names which are no longer needed but do have other symbols depending on it, so they can't be excluded.
The customization file is named cvs2svn.options. Please adapt the first part of the options file to your need (especially the paths). In it's current form it will generate a berkelydb based subversion repository. It can be configured to create just a subversion dump file.
$ cvs2svn --options=cvs2svn.options
Hint: I experienced problems with python memory leaks which lead to a out of memory condition. In this case the problem can be workarounded by running the first eight and the final pass of cvs2svn separately.
$ cvs2svn -p 1:8 --options=cvs2svn.options $ cvs2svn -p 9:9 --options=cvs2svn.options
It might be necessary to recreate parts of the cvs2svn.options customization file. This is the case if the list of excluded branches/tags is changed or new broken tags/branches (symbolic names) appear. To help in this there is the python script options.py. It requires a file with the name cws_done.csv in the available in the run directory. cws_done.csv is a list of all child workspaces which are considered finished and will not be migrated to subversion. Broken tags/branches are handled directly in options.py. Run the script with:
$ python options.py > options
and replace the with [option.py] marked section of cvs2svn.options with the content of options.
Notes
The scripts mentioned above have been updated as of 2007/10/30, with the exception of cws_done.csv, which is as of March 2007. The migration of the code repository takes about 84:20h on a [x4200]. I tried to upgrade the conversion script from cvs2svn-1.5.1 to cvs2svn-2.0.1 but failed, because the script was so slow it never finished.
Access to test server
Subversion
The result of the conversion can be accessed with subversion via the URL svn+ssh://svn@o3-build.services.openoffice.org/svn
Example: See the latest change to the repository:
$ svn info svn+ssh://svn@o3-build.services.openoffice.org/svn
You'll need a ssh key, send me (hr) your public key if you plan to take part in the testing.
A read-only service without authentication is available via the URLs svn://o3-build.services.openoffice.org/svn and http://o3-build.services.openoffice.org/svn.
Example: list all tags in the repository:
$ svn list svn://o3-build.services.openoffice.org/svn/tags $ svn list http://o3-build.services.openoffice.org/svn/tags
bazaar
The o3-build server also hosts a flat import of OpenOffice 2.3.0 in a bazaar repository. Note that this repository has no history information at all so it is not comparable to the subversion repository above. As soon as we have a working import I'll replace it with a real repository.
The bazaar repository can accessed via sftp and (read-only) via http. A smart server setup (bzr+ssh) will follow soon.
Example: lightweight checkout via sftp
$ bzr checkout --lightweight sftp://svn@o3-build.services.openoffice.org/srv/bzr/trunk my.lightweight.checkout
Example: branch via http
$ bzr branch http://o3-build.services.openoffice.org/~svn/bzr/trunk my.branch
Please note the differences in the access URLs.
git
The o3-build server also hosts an import of OpenOffice a git repository with about almost the same amount of history as the SVN repository above.
The git repository can accessed via the git protocol and (read-only) via http.
Example: clone via the git protocol
$ git clone git://o3-build.services.openoffice.org/git/ooo.git
Example: clone via http
$ git clone http://o3-build.services.openoffice.org/~svn/ooo.git
Please note the differences in the access URLs.
Replicate test server
The repository can be replicated with the svnsync tool. No special server side setup is necessary (read-only access is sufficient), but you need to make certain that the target repository can't be modified by other means than svnsync.
- First create an empty target repository:
$ svnadmin create /absolute/path/to/rep
- Implement the pre-revprop-change and start-commit hooks
$ cat /absolute/path/to/rep/hooks/pre-revprop-change
#!/bin/sh USER="$3" if [ "$USER" = "syncuser" ]; then exit 0; fi echo "Only the syncuser user may change revision properties" >&2 exit 1
$ cat /absolute/path/to/rep/hooks/start-commit
#!/bin/sh USER="$2" if [ "$USER" = "syncuser" ]; then exit 0; fi echo "Only the syncuser user may commit new revisions" >&2 exit 1
- Initialize the target repository:
$ svnsync init file:///absolute/path/to/rep http://o3-build.services.openoffice.org/svn
- And finally synchronize the target repository with the source repository
$ svnsync synchronize file:///absolute/path/to/rep http://o3-build.services.openoffice.org/svn
The full details for replicating SVN repositories can be found [here].
Evaluating the SCM candidates, Metrics
Evaluating centralized vs. distributed SCM systems for their viability for hosting the OpenOffice.org source code repository isn't that straight forward as one might hope. This is because the workflows differs substantially for distributed and centralized SCM systems. The best approach seems to define typical work flows for each SCM which will be evaluated against the test repositories above, defining our metrics. Since each developer group within OpenOffice.org community has quite different needs, there will be no one workflow which will fit for all. I (hr) would like to ask each developer group within the OOo community to add their expected typical work flow with each SCM below.
Sun Hamburg RE
Subversion
checkout (2) | tag anchor (3) | tag branch (4) | switch (5) | diff (7) | commit (8) | rebase (9) | commit (10) | move tag (11) | switch (12) | integrate (13) | |
Unix, local disk (warm) | 5m38s | ~1s | ~1s | 59s | 6s | 48s | 2m33s | 2m06s | ~1s | 39s | 2m30s |
Windows (cygwin), local disk | 35m25s | ~1s | ~1s | 5m03s | 35s | 3m48s | 4m36s | 4m50s | ~1s | 6m00s | 6m07s |
Unix, remote volume | 77m17s | ~1s | ~1s | 3.23s | 47s | 4m21s | 6m29s | 5m11 | ~1s | 5m35s | 8m69s |
Windows, remote volume | 157m14s |
local status over whole tree (warm) | local status over whole tree (cold) | log over single file | annotate over single file | |
Unix (local disk) | 2s | 33s | ~1s | ~1s |
Windows (local disk) | 19s | na (don't know how to drop caches on windows) | ~1s | ~1s |
CWS creation, workflow:
- tag test milestone on basis of OOo_2_3_0 release: [not timed]
- check out test milestone from o3-build: [time]
- tag test milestone with anchor tag: [time]
- tag test milestone with branch tag: [time]
- switch to branch tag: [time]
- make changes on test branch (~2157 files): [not timed]
- diff changes: [time]
- commit changes on branch: [time]
- rebase branch to newer milestone with non-conflicting changes (another ~2157 files): [time]
- commit changes on branch: [time]
- move anchor tag to new milestone: [time]
- switch to trunk: [time]
- integrate (merge) branch into trunk: [time]
Misc. operations:
- status over whole tree, cold: [time]
- status over whole tree, warm: [time]
- log on single file: [time]
- annotate on single file: [time]
Git
clone remote (2) | clone local (3) | create branch (4) | switch (5) | diff (7) | commit (8) | pull to pristine (9) | pull to working (10) | rebase (11) | push (12) | |
Unix, local disk (warm) | 24m13s | 1m58s | <1s | 3s | 3s | 8s | 49s | 6s | 1m04s | 9s |
Windows, local disk | 33m07s (some lock errors) | 14m43s | <1s | 13s | 10s | 25s | 1m16s | 30s | 1m51s | 15s |
Unix, remote volume | 41m98s | 13m15s | <1s | 54s | 35s | 36s | 1m15s | 1m10s | 56s | 16s |
Windows, remote volume | stops at 39% completion |
local status over whole tree (warm) | local status over whole tree (cold) | log over single file | annotate over single file | |
Unix (local disk) | 2s | 37s | 20s | 8s |
Windows (local disk) | 18s | na | 25s | 23 |
CWS creation, workflow:
- tag test milestone on basis of OOo_2_3_0 release: [not timed]
- clone git repository from o3-build to pristine local copy: [time]
- clone repository to local working copy: [time]
- create new branch in working copy: [time]
- switch to branch: [time]
- make changes on test branch (~1000 files): [not timed]
- diff changes: [time]
- commit changes on branch: [time]
- pull non-conflicting changes (another ~1000 files) from upstream into pristine copy: [time]
- pull non-conflicting changes from pristine copy into local copy: [time]
- rebase branch to newer milestone with non-conflicting changes: [time]
- push changes to upstream: [time]
Misc. operations:
- status over whole tree, cold: [time]
- status over whole tree, warm: [time]
- log on single file: [time]
- annotate on single file: [time]
Preliminary Evaluation results
For the February, 18th, 2008 ESC steering committee meeting we (Jan Holesovsky and Jens-Heiner Rechtien) prepared a paper about the current status of the SCM evaluation. Please find it here.
Evaluating DSCM candidates
The current version control system for OpenOffice.org is going to be replaced by a Distributed Software Configuration Management (DSCM) system. An Evaluation has been prepared and was presented in the March 2009 ESC meeting.
DSCM System Preferences Survey
During the ESC meeting it was suggested to consult OpenOffice.org contributors about experiences and preferences with version control systems.
The survey started 2009-03-12 and closed after 2 weeks.
Participation in the Survey
- Click on the link below. You will be asked to enter your name and your @openoffice.org email address. The email address is mandatory and we will ignore any submissions from any other email addresses.
- An email will be sent to your @openoffice.org address containing a link. Clicking on the link will take you to the survey system. Note: if you do not receive the email, please check your spam filter!
- When you have made your choices, the system will send you a second email to confirm that your selections have been stored.
http://surveys.services.openoffice.org/surveys/index.php?sid=52123&lang=en
SCM System Preferences Survey Thanks
Thank you for participating in the survey. The results will be taken into consideration for the final decision.
DSCM System Preferences Survey Results
149 contributors participated in the survey. The system of choice is for 3% Bazaar, for 23% Git and for 49% Mercurial. 25% had no preference.