gitgit-rebasegit-plumbing

How does `git rebase` work under the hood?


I recently started working with git trees and temporary index files to construct commits without modifying my working directory, for the purpose of automating some tasks. The ultimate goal is to effectively rebase some branch (e.g. feature/x_y_z on top of main), but without modifying the working directory to do it. Obviously I still want to detect conflicts, and definitely not clobber changes on main (as one could using git commit-tree). I read through the "Git Internals" chapter of the book, which was very educational regarding trees, blobs, indexes, etc. — but didn't explicitly explain how a rebase works.

(Aside: The motivation for this is 1) it's way faster, and 2) I want to enable developers to kick off tests for some commit/branch, quickly, consuming latest canonical changes, without clobbering their working directory.)

To that end, how does git rebase work under the hood? What plumbing commands does it use? How does it analyze trees to detect conflicts? Links to useful sources, and/or a direct explanation of these things, would be very helpful.


Solution

  • To that end, how does git rebase work under the hood?

    It's complicated. It's particularly complicated because of history: it was originally a small shell script using git format-patch and git am, but that has some flaws, so it was rewritten as a much fancier set of shell scripts. These included a merge based back-end and an interactive back-end, splitting off the old am-based back-end into a third variant. Since then, it's been rewritten yet again, in C, to use what Git calls the sequencer. The interactive code has been fancied up to allow re-performing merges as well. I'll ignore these cases as they're harder to draw and explain.

    What plumbing commands does it use?

    Now that it has been rewritten in C, it does not use any of them.

    In the old days, the interactive back end mostly used git cherry-pick (which technically isn't quite a plumbing command), plus git commit --amend for squash operations, after using git rev-list to collect the hash IDs of the commits to copy with cherry-pick.

    The C variant is being modified now to build in more and more parts (to make things go faster on Windows, mainly) but does still invoke the merge separately at the moment.

    How does it analyze trees to detect conflicts?

    This is the fundamental work of git cherry-pick: it invokes git merge but sets the merge base to be the parent of the commit being copied. The current commit at this time is the commit at the tip of the branch being extended to achieve the rebase.

    That is, we have something along these lines:

        H--I--J   <-- to-copy (HEAD)
       /
    ...--o--o--o   <-- optional-random-other-stuff-cluttering-up-the-diagram
             \
              A--B--C   <-- target
    

    The branch we want to "rebase" is the one identified by the name to-copy; the commit we want the copies to appear after is commit C. So we run:

    git checkout to-copy
    

    to make sure we're starting from the right place, then run:

    git rebase target
    

    Or, if what we have looks like this:

    ...--A--B--C   <-- main
          \
           D--E--F--G   <-- feature1
               \
                H--I--J   <-- to-copy (HEAD)
    

    and we want to copy just H-I-J to land after C, we run:

    git rebase --onto main feature1
    

    so as to exclude commits D-E from the copy list.

    The rebase operation first generates a list of commit hash IDs to copy, in this case, the actual raw hash IDs for commits H through J inclusive.

    Rebase will normally omit, from this list, certain commits:

    For most simple linear chains of commits, this omission step does nothing at all; that's the case with the commits I'm illustrating here.

    Having built the list of commit hash IDs to copy, rebase now checks out the --onto target commit C as a detached HEAD. If there is no --onto argument, the target commit is that specified by the upstream argument after the git rebase command, or is that specified by the upstream of the branch before the HEAD-detaching step. So, for the more-complicated --onto variant, we now have this:

    ...--A--B--C   <-- main, HEAD
          \
           D--E--F--G   <-- feature1
               \
                H--I--J   <-- to-copy
    

    Rebase now cherry-picks each to-be-copied commit, one at a time, in the appropriate and necessary order (H first, then I, then J). Each of these cherry-pick operations is handled as if it were a git merge, but with a peculiarly forced merge base commit. I'll go into detail in a moment, but let's just assume that the cherry-pick of H works and makes a new commit; let's call the new commit H', to indicate that it's a "copy" of H, and draw it in:

                 H'  <-- HEAD
                /
    ...--A--B--C   <-- main
          \
           D--E--F--G   <-- feature1
               \
                H--I--J   <-- to-copy
    

    We now repeat this with I and J to get:

                 H'-I'-J'  <-- HEAD
                /
    ...--A--B--C   <-- main
          \
           D--E--F--G   <-- feature1
               \
                H--I--J   <-- to-copy
    

    Once the last of the to-be-copied commits has been copied, git rebase yanks the name of the original branch off the original commit J and pastes it on to point to the final copied commit, in this case J', and then re-attaches HEAD:

                 H'-I'-J'  <-- to-copy (HEAD)
                /
    ...--A--B--C   <-- main
          \
           D--E--F--G   <-- feature1
               \
                H--I--J   ???
    

    With no name to find commit J, it disappears from our view, and it now seems as though Git has somehow changed three commits. (It hasn't—the originals are still in the repository. You can find them through reflogs, or through ORIG_HEAD, though the C rewrite of rebase introduced a bug where ORIG_HEAD is sometimes wrong.)


    1The actual symmetric difference used is HEAD...target, more or less. (Since it's symmetric, you can swap the left and right sides, as long as you remember which side is which.) So these are the commits that have their patch-IDs computed. Git can compute patch IDs even for merge commits, though rebase normally omits merges. I've never dived in far enough to see if it does compute them when you tell it to copy merges, and what happens in this case if a merge commit does have a duplicate, but that's an interesting question.


    Git's merge engine

    To understand cherry-picking, let's start with a much more normal, everyday operation: a true merge. When we do a true merge, we're combining work. Let's suppose we have the following commit graph:

              I--J   <-- br1 (HEAD)
             /
    ...--G--H
             \
              K--L   <-- br2
    

    That is, we have two branches, br1 and br2, each of which has two commits of its own that follow some shared sequence of commits ending at commit H.

    As you know by now from reading up on Git internals, each commit has a full snapshot of every file. Being a snapshot, not a set of changes, there's no obvious way to view the snapshot as changes, until you realize that what Git does is play, over and over again, a game of Spot the Difference. We put two commits up somewhere, as two snapshots, then programmatically eyeball each one and figure out what's changed. That's what git diff does.

    Now, if we run a diff from commit H to commit J, that will tell us what changed between those two snapshots. By itself, that's not particularly useful, but suppose we save this information somewhere. Let's run:

    git diff --find-renames <hash-of-H> <hash-of-J>   # what we changed
    

    to find out what we changed on br1 since commit H. We'll save all this somewhere, perhaps in a temporary file or (if we have enough memory) memory.

    Now let's repeat this operation but with the hash of commit L instead:

    git diff --find-renames <hash-of-H> <hash-of-L>   # what they changed
    

    This tells us what happened on br2.

    If we add the changes together, being careful to take only one copy of any given change that's on both branches, and apply the sum of the two sets of changes to snapshot H, we'll get the correct merge result.2

    This, then, is exactly what merge does. It just runs two diffs—using --find-renames to find any tree-wide file rename operations, so that it knows that file old/path/to/file in the merge base is "the same file" as new/name/of/it in the left and/or right side tip commit—and then combines the change-sets from the two diffs, applying them to each file.3

    If the combining goes well, and the merge isn't inhibited with --no-commit,4 Git will go on to make merge commit M on its own. Instead of the normal single parent, which in this case would be commit J, the merge commit has two parents. The first one is the normal one, and the second one is the other branch-tip commit, commit L:

              I--J
             /    \
    ...--G--H      M   <-- br1 (HEAD)
             \    /
              K--L   <-- br2
    

    and the merge is complete.

    If there are conflicts, Git leaves a mess in both its index (the expanded slots remain expanded) and your work-tree: the built in equivalent of git merge-file has scribbled over your work-tree files, to put in its best guess at the correct merge, plus merge conflict markers and sections of two—or with merge.conflictStyle set to diff3, all three—of the input files where there were merge conflicts.

    Note that using -X ours or -X theirs tells Git to resolve conflicted sections by blindly picking our or their side. This only affects these low-level conflicts: add/add, modify/delete, and other high-level or tree-level conflicts still cause the merge to stop and get help.

    (For cherry-pick, these options are all currently handled through the git-merge-recursive back-end, with no way to choose any other merge back-end. For git merge, the -s argument, e.g., git merge -s abcd, makes Git try to run git-merge-abcd. Of course there is no git-merge-abcd back end in the git --exec-path directory, so this would just fail, but the point here is that regular merge lets you select a strategy. Recursive merge is just the default. Cherry-picking does not allow strategy selection.)


    2For some definition of "correct", of course. Git's is entirely line-based: the diffs are line-by-line and the merge works line-by-line.

    3Well, that's the high level overview anyway. Internally, it does the rename-finding as one pass, then the high-level or tree-level file name things as needed—this also handles file creations and deletions, and detects add/add, modify/delete, rename/rename, and other such conflicts—and then goes on to use a separate second pass that has git merge-file built into it to merge each individual group of three files: merge-base, ours, and theirs. The merge process takes place in Git's index, which is temporarily expanded to hold up to three copies of each file, with slot numbers to tell which one is the merge base version (slot 1), which one is the --ours version (slot 2), and which one is the --theirs version (slot 3).

    4Note that --squash turns on --no-commit and there's currently no way to turn it off again, so that --squash always requires a manual git commit at the end.


    How cherry-pick uses Git's merge engine

    To achieve a cherry-pick, Git simply runs its merge engine with a forced parent.

    Suppose we have this commit graph:

    ...--o--P--C--o--...
      ...
        ...--G--H   <-- cur-branch (HEAD)
    

    We are on the current branch cur-branch, with commit H as its tip commit, so that commit H is the current commit. We now run:

    git cherry-pick <hash-of-C>
    

    What Git does is find C's parent P and use that as a fake merge base for a standard merge operation, but make sure that at the end of the merging process, we make a normal, non-merge commit:

    ...--o--P--C--o--...
      ...
        ...--G--H--C'  <-- cur-branch (HEAD)
    

    Commit C' ends up being a "copy" of commit C. To see why, let's look at what diffs turn up.

    git diff --find-renames <hash-of-P> <hash-of-H>   # what we changed
    git diff --find-renames <hash-of-P> <hash-of-C>   # what they changed
    

    Now, finding "what they changed" seems pretty natural. If they added a line after line 42 of some file, that's what we want to do here. So this diff makes perfect sense. But finding "what we changed" seems a bit odd at first. It turns out, though, that it's exactly what we need.

    If they only changed one line of one file, we want to know: did we touch the same line of that file? If not, the changes combine nicely: we take all our changes, to convert all the files in P to match all the files in H, and that gives us back commit H; then we add the one change they made, to the one file, with any line number adjustments required as well, and that adds what they changed in the one file. So that's exactly right.

    If we did both touch the same line of that file, we get a merge conflict, right on that one line. That's exactly right too.

    As we ponder all possible changes—including things like file renamings—we will see that, indeed, making a diff from P to H is just the right thing to do. So that's why we do it. It also means we get to use the existing merge code.

    Of course, the existing merge code operates on/in the index, and uses our working tree as scratch storage. That's why rebasing is relatively slow and painful. The only real way to improve on this is to do more of the merge work directly in memory. Someone is doing this now: search for "merge-ort" from Elijah Newren in the Git mailing list.