SCM Migration

From Apache OpenOffice Wiki
Revision as of 16:27, 12 April 2007 by Hr (Talk | contribs)

Jump to: navigation, search

Glossary

The exact meaning of two terms is essential for the following migration guide:

  • project: a top level project, with a project lead and a separate space on the OpenOffice.org web site and in the OpenOffice.org repository. Example: gsl (the project which hosts the vcl code module, rsc the resource compiler and 16 other code modules), zh (this project hosts the Chinese language community).
  • module: the next level of structure is called module. Code projects typically host several modules, language projects usually have only a www module. Attention: Some modules are called like their hosting project, for example sw is also a module in the project sw.

Repository restructuring

Whether we take subversion as new SCM tool or a distributed SCM like git or mercurial, the necessary migration is also a good opportunity to restructure our repository and to do some badly needly clean up.

This restructure and migration guide is geared towards a migration to subversion, but the same principles can and should be applied to a potential migration to another SCM tool.

Currently we have 136 top level projects. Inside these projects we have varying numbers of modules, either web content modules or code modules. Many projects are dedicated to the OpenOffice.org language communities which are essentially independent from each other. Modules from code projects on the other hand are highly dependent from each other. We got 250 of these code modules (some of them historical).

The idea is to move all modules containing OOo source code into a single repository. After that, each project get it's own repository, which is mostly for web content. After the migration, modules inside the new "code project" get linked into their original projects, to maintain the integrity of these projects.

Clean up

In 6 years we accumulated a lot of cruft in the CVS repository. We take the opportunity and skip some dead ends from the migration. The rule is, that every released version of OOo must be represented in the new repository. Otherwise we are pretty free to define what we want to migrate and what not. Currently I plan to implement the following strategy:

  1. migrate all releases of OOo to the new SCM, this means release tags and branches must be preserved
  2. skip experimental branches and tags if they can be proven to be obsolete
  3. skip obvious dead parts of the repository
  4. skip tags and branches of all CWS with status integrated, finished, deleted or canceled at a certain date (currently the date is 2007/05/15)

The last rule reduces the number of the to be migrated branches from about 5000 to about 500.

Recipe for migrating the code repository

For a migration to subversion a fast Unix machine with cvs, subversion-1.4 and the cvs2svn python script installed is needed.

Copy and restructure the CVS code repository

Create a copy of the CVS repository. In the following <work> is the directory which contains the 136 OOo top level projects.

Compare the repository with the reference module list

I've prepared three files to help in the migration. The repository structure document repositorystructure.ods contains the new structure, the script repositorystructure.sh moves code modules into the new code project and removes obsolete stuff.

The OOo repository is constantly growing. Before you start the migration it's mandatory to check if new projects or modules have been added to the CVS repository. Compare the directory with repositorystructure.txt.reference.

$ cd <work>
$ echo */* | sed -e 'y/ /\n/' | sort > repositorystructure.txt
$ diff repositorystructure.txt.reference repositorystructure.txt

If new projects and/or modules have been added, please add them to repositorystructure.sh and repositorystructure.ods according the above mentioned principles and don't forget to upgrade the reference repositorystructure.txt.reference, too.

Restructure repository

Use the repositorystructure.sh script to restructure the repository.

$ cd <work>
$ sh repositorystructure.sh

This script moves all modules with source code into the new code project but leaves the language projects alone. Additionally it removes some cruft from the module level of the repository, like nonsensical empty modules etc.

Repository clean up

The CVS repository contains a number of broken CVS archives, which fall in three categories:

  • Files which do not contain any revisions, but just the RCS header. These files are no valid RCS files and can safely be removed.
  • Files which are present in the a <dir> and also <dir/Attic>, for example hu/hu-po/crashrep.po,v and hu/hu-po/Attic/crashrep.po,v. Here a decision has to be made which version is the right one, the other one must be removed. The one which we will keep is the one with the higher head revision number.
  • Files which have a tag/branch on a deleted revision. The tags/branches have to be removed via the rcs command.

The script cleanbrokenfiles.sh does the cleanup. It requires that the RCS command rcs can be found in $PATH.

$ cd <work>
$ sh cleanbrokenfiles.sh

It's quite likely that there are more cases with files present in <dir> and <dir/Attic> in the meantime. These can be found with the python script finddouble.py (requires python 2.5).

$ python finddouble.py

If this tool prints one or more lines, please add them to the cleanbrokenfiles.sh script and rerun the clean script. The output of finddouble.py is formated in a way that the line(s) can be directly added to the clean script.

Converting the repository

We convert the repository with python script cvs2svn-1.5.1. This script has quite a few dependencies, it needs a working berkelydb, berkelydb python bindings, rcs tools in path, subversion-1.4, and subversion python bindings. cvs2svn is very flexible and detects quite a few CVS repository inconsistencies. It has 9 passes. Many problems will be found during the first pass which parses all the *,v CVS archives. If you encounter a problem in this pass you'll have to go back to the last section and add the affected files to the clean up script and restart the conversion afterwards.

Converting the language projects

Please use the script convert.sh. The first section of the script contains a few paths, please adapt them to your needs. The script iterate over all 138 language projects and creates 138 subversion repositories. If in the meantime more language projects have been introduced add them to this script accordingly. The script assumes that the cvs2svn script is in your path. It will create subversion repositories in the berkelydb format.

$ cd ..
$ sh convert.sh

Converting the code modules

For converting the code modules we'll need the full flexibility of cvs2svn. cvs2svn can be customized via a so called "option" file. This files contains python instructions on how exactly the conversion should be done. We need customization for:

  • excluding old and/or obsolete tags and branch names (called symbolic names by cvs2svn)
  • force mixed tag/branch symbolic names to either a tag or a branch
  • resolve the problem of "Blocked Exclusion", that is symbolic names which are no longer needed but do have other symbols depending on it, so they can't be excluded.

The customization file is named cvs2svn.options. Please adapt the first part of the options file to your need (especially the paths). In it's current form it will generate a berkelydb based subversion repository. It can be configured to create just a subversion dump file.

$ cvs2svn --options=cvs2svn.options

Hint: I experienced problems with python memory leaks which lead to a out of memory condition. In this case the problem can be workarounded by running the first eight and the final pass of cvs2svn separately.

$ cvs2svn -p 1:8 --options=cvs2svn.options
$ cvs2svn -p 9:9 --options=cvs2svn.options

It might be necessary to recreate parts of the cvs2svn.options customization file. This is the case if the list of excluded branches/tags is changed or new broken tags/branches (symbolic names) appear. To help in this there is the python script options.py. It requires a file with the name cws_done.csv in the available in the run directory. cws_done.csv is a list of all child workspaces which are considered finished and will not be migrated to subversion. Broken tags/branches are handled directly in options.py. Run the script with:

$ python options.py > options

and replace the with [option.py] marked section of cvs2svn.options with the content of option.

Personal tools