Wednesday 26 March 2014

How to do history-sensitive merges in Git

Merging in Git is usually not history-sensitive. By this I mean: if you're merging branches A and B together, Git looks at the content at the tips of branches A and B, and the content of the common ancestor commit(s) of A and B, but it doesn't look at any commits inbetween. Git just does a 3-way merge. This can make merging more painful than it needs to be.

History-sensitive merging is a merge algorithm that does look at the commit history. It can automatically resolve conflicts in cases where non-history-sensitive merging won't. It can also present conflicts more readably: It can reduce the size of a conflict and show which upstream change conflicted with your local change.

This is a short "how to" for doing a history-sensitive merge in Git.

My case study is: Merging LLVM 3.4 into PNaCl's branch of LLVM, which I did recently.

The problem

PNaCl has a branch of LLVM with various patches applied. (As an aside, we should really upstream those patches, to reduce the difficulty of merging from upstream.) Before the merge, this branch was based on LLVM 3.3 (commit $FROM). We wanted to upgrade it to being based on LLVM 3.4 (commit $TO).

Simply doing

git merge $TO

produced about 160 conflicts, as counted by "git grep '<<<<<<<' | wc -l".

A large number of those conflicts occurred because pnacl-llvm contained various changes cherry-picked from upstream LLVM. Often those changes touched code that was later refactored upstream. That will produce a conflict if we do a 3-way merge.

For example, pnacl-llvm cherry-picked various changes that add test files with "CHECK:" lines. Upstream later refactored these to "CHECK-LABEL:". If we do "git merge $TO", Git will see that the two branches added a file with the same name but different contents. Bam: that's a conflict. Git's 3-way merge doesn't look at the history, so it doesn't notice that both branches contain identical commits that add the file with a "CHECK:" line, and just one of the two branches later modifies that file.

A solution

We can do better just by telling Git to merge each upstream LLVM change one by one:

set -eu  # Stop on error
commits="$(git rev-list --reverse $FROM..$TO)"
for commit in $commits; do
  echo Merging $commit
  git merge --no-edit $commit
done

When this reaches a cherry-picked change, it should merge it with no conflict, and later merge the CHECK-LABEL refactoring with no conflict.

When this does hit a conflict -- i.e. when an upstream change conflicts with a local change -- it shows which upstream change conflicted. This provides more context for resolving the conflict.

This script is restartable. After you've resolved a conflict and git commit'd the resolution, it's fine to re-run the script. It will skip over the already-merged upstream changes.

This approach generates many small merges, usually with small conflicts, which can be resolved incrementally and committed to a branch. That is often easier than resolving a large set of conflicts resulting from one big merge, where it's also difficult to save or reuse a work-in-progress conflict resolution.

The one-by-one merge approach has some disadvantages. If a change is committed and reverted upstream, and conflicts with your local branch, you'll be prompted to resolve the conflict twice, back and forth. If you notice the change is committed, reverted and then re-applied upstream, it can be quicker to manually git merge the re-applied version.

Going faster

The script above goes quite slowly because it merges each change in turn. An optimisation is: explicitly merge only those upstream changes that modify files that our local branch also modifies (since all other upstream changes will be trivially merged).

set -eu  # Stop on error
files="$(git diff --name-only $FROM origin/master)"
# Relatively slow
git rev-list --reverse $FROM..$TO $files > tmp_commits_to_merge
echo $TO >> tmp_commits_to_merge  # Don't miss this commit

for commit in $(cat tmp_commits_to_merge); do
  echo Merging $commit
  git merge --no-edit $commit
done

Cleaning up

This approach produces a lot of Git merge commits. If you don't want to keep this history, you can squash the merges down into a single merge commit:

tree=$(git log --pretty=format:%T --no-walk HEAD)
commit=$(git commit-tree -m 'Commit message for merge...' \
    -p $(git merge-base HEAD origin/master) \
    -p $(git merge-base HEAD upstream/master) \
    $tree)
git push . $commit:my-merge

I believe the term "history-sensitive merge" comes from Darcs, a version control system which implements history-sensitive merging.