Wednesday 11 August 2010

My workflow with git-cl + Rietveld

Git's model of changes (which is shared by Mercurial, Bazaar and Monotone) makes it awkward to revise earlier patches. This can make things difficult when you are sending out multiple, dependent changes for code review.

Suppose I create changes A and B. B depends functionally on A, i.e. tests will not pass for B without A also being applied. There might or might not be a textual dependency (B might or might not modify lines of code modified by A).

Because code review is slow (high latency), I need to be able to send out changes A and B for review and still be able to continue working on further changes. But I also need to be able to revisit A to make changes to it based on review feedback, and then make sure B works with the revised A.

What I do is create separate branches for A and B, where B branches off of A. To revise change A, I "git checkout" its branch and add further commits. Later I can update B by checking it out and rebasing it onto the current tip of A. Uploading A or B to the review system or committing A or B upstream (to SVN) involves squashing their branch's commits into one commit. (This squashing means the branches contain micro-history that reviewers don't see and which is not kept after changes are pushed upstream.)

The review system in question is Rietveld, the code review web app used for Chromium and Native Client development. Rietveld does not have any special support for patch series -- it is only designed to handle one patch at a time, so it does not know about dependencies between changes. The tool for uploading changes from Git to Rietveld and later committing them to SVN is "git-cl" (part of depot_tools).

git-cl is intended to be used with one branch per change-under-review. However, it does not have much support for handling changes which depend on each other.

This workflow has a lot of problems:

  • When using git-cl on its own, I have to manually keep track that B is to be rebased on to A. When uploading B to Rietveld, I must do "git cl upload A". When updating B, I must first do "git rebase A". When diffing B, I have to do "git diff A". (I have written a tool to do this. It's not very good, but it's better than doing it manually.)
  • Rebasing B often produces conflicts if A has been squash-committed to SVN. That's because if branch A contained multiple patches, Git doesn't know how to skip over patches from A that are in branch B.
  • Rebasing loses history. Undoing a rebase is not easy.
  • In the case where B doesn't depend on A, rebasing branch B so that it doesn't include the contents of branch A is a pain. (Sometimes I will stack B on top of A even when it doesn't depend on A, so that I can test the changes together. An alternative is to create a temporary branch and "git merge" A and B into it, but creating further branches adds to the complexity.)
  • If there is a conflict, I don't find out about it until I check out and update the affected branch.
  • This gets even more painful if I want to maintain changes that are not yet ready for committing or posting for review, and apply them alongside changes that are ready for review.

There are all reasons why I would not recommend this workflow to someone who is not already very familiar with Git.

The social solution to this problem would be for code reviews to happen faster, which would reduce the need to stack up changes. If all code reviews reached a conclusion within 24 hours, that would be an improvement. But I don't think that is going to happen.

The technical solution would be better patch management tools. I am increasingly thinking that Darcs' set-of-patches model would work better for this than Git's DAG-of-commits model. If I could set individual patches to be temporarily applied or unapplied to the working copy, and reorder and group patches, I think it would be easier to revisit changes that I have posted for review.

Friday 6 August 2010

CVS's problems resurface in Git

Although modern version control systems have improved a lot on CVS, I get the feeling that there is a fundamental version control problem that the modern VCSes (Git, Mercurial, Bazaar, and I'll include Subversion too!) haven't solved. The curious thing is that CVS had sort of made some steps towards addressing it.

In CVS, history is stored per file. If you commit a change that crosses multiple files, CVS updates each file's history separately. This causes a bunch of problems:

  • CVS does not represent changesets or snapshots as first class objects. As a result, many operations involve visiting every file's history.

    Reconstructing a changeset involves searching all files' histories to match up the individual file changes. (This was just about possible, though I hear there are tricky corner cases. Later CVS added a commit ID field that presumably helped with this.)

    Creating a tag at the latest revision involves adding a tag to every file's history. Reconstructing a tag, or a time-based snapshot, involves visiting every file's history again.

  • CVS does not represent file renamings, so the standard history tools like "cvs log" and "cvs annotate" are not able to follow a file's history from before it was renamed.

In the DAG-based decentralised VCSes (Git, Mercurial, Monotone, Bazaar), history is stored per repository. The fundamental data structure for history is a Directed Acyclic Graph of commit objects. Each commit points to a snapshot of the entire file tree plus zero or more parent commits. This addresses CVS's problems:

  • Extracting changesets is easy because they are the same thing as commit objects.
  • Creating a tag is cheap and easy. Recording any change creates a commit object (a snapshot-with-history), so creating a tag is as simple as pointing to an already-existing commit object.

However, often it is not practical to put all the code that you're interested in into a single Git repository! (I pick on Git here because, of the DAG-based systems, it is the one I am most familar with.) While it can be practical to do this with Subversion or CVS, it is less practical with the DAG-based decentralised VCSes:

  • In the DAG-based systems, branching is done at the level of a repository. You cannot branch and merge subdirectories of a repository independently: you cannot create a commit that only partially merges two parent commits.
  • Checking out a Git repository involves downloading not only the entire current revision, but the entire history. So this creates pressure against putting two partially-related projects together in the same repository, especially if one of the projects is huge.
  • Existing projects might already use separate repositories. It is usually not practical to combine those repositories into a single repository, because that would create a repo that is incompatible with the original repos. That would make it difficult to merge upstream changes. Patch sharing would become awkward because the filenames in patches would need fixing.

This all means that when you start projects, you have to decide how to split your code among repositories. Changing these decisions later is not at all straightforward.

The result of this is that CVS's problems have not really been solved: they have just been pushed up a level. The problems that occurred at the level of individual files now occur at the level of repositories:

  • The DAG-based systems don't represent changesets that cross repositories. They don't have a type of object for representing a snapshot across repositories.
  • Creating a tag across repositories would involve visiting every repository to add a tag to it.
  • There is no support for moving files between repositories while tracking the history of the file.

The funny thing is that since CVS hit this problem all the time, the CVS tools were better at dealing with multiple histories than Git.

To compare the two, imagine that instead of putting your project in a single Git repository, you put each one of the project's files in a separate Git repository. This would result in a history representation that is roughly equivalent to CVS's history representation. i.e. Every file has its own separate history graph.

  • To check in changes to multiple files, you have to "cd" to each file's repository directory, and "git commit" and "git push" the file change.
  • To update to a new upstream version, or to switch branch, you have to "cd" to each file's repository directory again to do "git pull/fetch/rebase/checkout" or whatever.
  • Correlating history across files must be done manually. You could run "git log" or "gitk" on two repositories and match up the timelines or commit messages by hand. I don't know of any tools for doing this.

In contrast, for CVS, "cvs commit" works across multiple files and (if I remember rightly) even across multiple working directories. "cvs update" works across multiple files.

While "cvs log" doesn't work across multiple files, there is a tool called "CVS Monitor" which reconstructs history and changesets across files.

Experience with CVS suggests that Git could be changed to handle the multiple-repository case better. "git commit", "git checkout" etc. could be changed to operate across multiple Git working copies. Maybe "git log" and "gitk" could gain options to interleave histories by timestamp.

Of course, that would lead to cross-repo support that is only as good as CVS's cross-file support. We might be able to apply a textual tag name across multiple Git repos with a single command just as a tag name can be applied across files with "cvs tag". But that doesn't give us an immutable tag object than spans repos.

My point is that the fundamental data structure used in the DAG-based systems doesn't solve CVS's problem, it just postpones it to a larger level of granularity. Some possible solutions to the problem are DEPS files (as used by Chromium), Git submodules, or Darcs-style set-of-patches repos. These all introduce new data structures. Do any of these solve the original problem? I am undecided -- this question will have to wait for another post. :-)