The Trunk - Subversion Conversion to Mercurial, Part 1

So you’ve got a bunch of code, the key to your companies future, and like a good developer you’re keeping it under source control using Subversion. But you’ve heard about these new distributed version control systems, like Mercurial, and after doing some research you’ve decided to take the plunge. But now you face a challenge: how to get all that juicy code into a Mercurial repository?

Never fear! The wonderful Mercurial developers have created an the excellent extension that can convert Subversion repositories to Mercurial. Because I work on the Kiln tool to import Mercurial repositories from existing code in a different source control system, I’ve been working to understand more about how the whole conversion process works. To get a good basis of understanding, let’s first look at how the extension will import a single line of development - the trunk. This is reasonable in cases where there are no branches or where all branches have already been merged into trunk, and you don’t care about which changes were made on which branches. I’ll explore the process of converting tags and branches in future posts.

A Generic Algorithm


self.ui.status(_("scanning source...\n"))
heads = self.source.getheads()
parents = self.walktree(heads)

The convert extension is designed to be a generic converter from many different repository types. The overall convert algorithm is handled at this generic level, while the details of retrieving specific revisions, files, and tags are handled by converter source objects that are specific to the source repository type. A destination converter object (in this case Mercurial-specific), does the work of writing the revisions, files, and tags to the new Mercurial repository.

So you fire up your trusty shell, and kick off:
hg convert svn://path/to/your/svn/repository  --datesort

After parsing the command line, the converter creates the source and destination objects. We didn’t specify a filemap, but if a filemap had been specified, the converter would then wrap the source object, which is the subversion converter object, with a filemap source object. This filemap object uses the filemap file to adjust file paths before they are passed to or returned from the source repository object.

When you execute a typical hg convert command, the first output line you’ll see  is this:
scanning source...

When this appears, the converter begins by asking the source object to get the heads, or the latest revisions, of that repository. Because we’re ignoring branches for now, the subversion converter object will just get the latest revision under the trunk. After getting this revision, it walks backwards through the revisions until it reaches the beginning. As the converter retrieves each revision from the source converter object, it caches it and creates a map from the revision to a list of it’s parents. In the case of a single Subversion trunk, each revision will only have one parent.

Sorting the revisions


self.ui.status(_("sorting...\n"))
t = self.toposort(parents, sortmode)
num = len(t)

The converter needs to process the revisions in the right order. The hg convert command gives three sorting options: datesort, branchsort, sourcesort. The sourcesort option is not available when converting from Subversion. To perform either of the other sorts, the converter first creates a children map from the parents map, as well as a list of the roots, or revisions without parents. For our trunk-only conversion there will only be one root revision. Starting at this root revision, the converter chooses the next revision based on the ordering type. Then it adds any revisions whose parents are all in the ordering to the list of possibly next revisions, from which the next revision is chosen. For a subversion trunk-only conversion, there will only ever be one revision to choose from, regardless of the sort order. Therefore, I’ll discuss the differences between datesort and branchsort in part 2, on converting branches.

Importing changes


def copy(self, rev):
commit = self.commitcache[rev]
files, copies = self.source.getchanges(rev)
parents = [self.map[p] for p in commit.parents]
newnode = self.dest.putcommit(files, copies, parents, commit,
self.source, self.map)
self.source.converted(rev, newnode)
self.map[rev] = newnode

Now that we’ve got this sorted list of revisions, the converter can start the process of converting each one individually. It does this by retrieving the appropriate changes from subversion and copying them to Mercurial.

When it initially walked the tree of changes, the subversion converter object stored the paths of files in each revision as well as the parent revisions. Because we’re looking at a trunk-only conversion, each revision will only ever have 1 parent. As the conversion proceeds, each of these revisions has the paths expanded. Each path is checked to see if it is a file, a directory, or a deleted item. File paths are recoded appropriately. Paths representing directories are expanded to include all files in the directory at that revision, and records of copied files and directories are also stored.

The Mercurial converter object then goes through the files and copies and retrieves the contents of each file from the subversion converter object. It uses the file contents to create the revision to be committed to the destination repository. That’s it. Your conversion is all done! Or is it? What if someone makes more changes to the subversion repository after you already performed the conversion?

Multiple hg convert runs


# Record converted revisions persistently: maps source revision
# ID to target revision ID (both strings).  (This is how
# incremental conversions work.)
self.map = mapfile(ui, revmapfile)

The hg convert extension supports multiple executions against the same source and destination repositories. This can be useful if you did one run of hg convert, and then later wanted to pull in further development from your subversion repository. This feature is primarily made possible by the revmap, a file that hg convert saves in the destination’s .hg directory. The revmap is just a simple map from revision ids in the source repository to revision ids in the destination repository. The hg convert extension reads this revmap in (if it exists) before beginning conversion. It uses the revmap to determine which revisions have already been converted, and accordingly begins with revisions that come after those already converted. One option, when running hg convert, is to specify where the revmap is - or where to save it if this is the first run against a given repository.

Another trick to consecutive hg convert runs is the authormap. The authormap is a file that allows you to change author names when converting from Subversion to Mercurial, which can be quite useful if you want to add additional information to Mercurial users, such as email addresses. The authormap, like the revmap, is stored in the destination .hg directory. On subsequent hg convert runs, this file is read in and used if no authormap is specified. If there is both an authormap specified on the command line and one in the destination .hg directory, the two are merged, with the one on the command line winning whenever there is a discrepancy.

How the filemap works


fmap = opts.get('filemap')
if fmap:
srcc = filemap.filemap_source(ui, srcc, fmap)
destc.setfilemapmode(True)

One last aspect of conversion deserves consideration - the filemap. Implementation of the filemap uses an interesting design. The code for handling the filemap is in a filemap converter object, much like the subversion converter object. This filemap converter wraps the subversion converter and does the mapping in a way that both the subversion converter and the hg converter can be oblivious to its presence.

The filemap converter object handles two major pieces of functionality. First, it takes care of renaming files. The renaming of files is done by a filemapper, which keeps a map of from and to filenames. Whenever filenames are passed to or from the converter object, it does the mapping necessary.

The more interesting challenge is determining which files and revisions should actually be included in the conversion. First, the filemap converter checks to see if a given revision includes any files that are included in the filemap. If so, then the revision needs to be converted. But the revision also needs to have it’s parent updated to the correct revision. In subversion, the parent of a given revision is simply the previous revision. Of course, that revision might not include any files in the filemap, and so be discarded during conversion. So the filemap converter needs to reparent the new revision to the last included revision also.

Coming Soon …


This algorithm at its root is quite simple. But understanding what is going on in the simple case is essential to understanding what is happening when we make it more complicated with branches, tags, and the options associated with them.  Part two will be a detailed look at how branches are converted from Subversion to Mercurial.