gitmergegit-rewrite-historygit-branch-sculpting

How can I combine Git repositories into a linear history?


I have two git repositories R1 and R2, which contain commits from two periods of a product's development: 1995-1997 and 1999-2013. (I created them by converting existing RCS and CVS repositories into Git.)

R1:
A---B---C---D

R2:
K---L---M---N

How can I combine the two repositories into a single one that contains an accurate view of the project's linear history?

A---B---C---D---K---L---M---N

Note that between R1 and R2 files have been added, deleted, and renamed.

I tried creating an empty repository and then merging their contents onto it.

git remote add R1 /vol/R1.git
git fetch R1

git remote add R2 /vol/R2.git
git fetch R2

git merge --strategy=recursive --strategy-option=theirs R1
git merge --strategy=recursive --strategy-option=theirs R2

However, this leaves in the end files that were in revision D, but not in revision K. I could craft a synthetic commit to remove the extra files between the merges, but this seems inelegant to me. Furthermore, through this approach the end-result contains merges that didn't actually occur.


Solution

  • Using git filter-branch

    Using the trick straight from the git-filter-branch man page:

    First, create a new repository with the two original ones as remotes, just as you did before. I am assuming that both use the branch name "master".

    git init repo
    cd repo
    git remote add R1 /vol/R1.git
    git fetch R1
    git remote add R2 /vol/R2.git
    git fetch R2
    

    Next, point "master" (the current branch) to the tip of R2's "master".

    git reset --hard R2/master
    

    Now we can graft the history of R1's "master" to the beginning.

    git filter-branch --parent-filter 'sed "s_^\$_-p R1/master_"' HEAD
    

    In other words, we are inserting a fake parent commit between D and K so the new history looks like:

    A---B---C---D---K---L---M---N
    

    The only change to K through N is that K's parent pointer changes, and thus all of the SHA-1 identifiers change. The commit message, author, timestamp, etc., stay the same.

    Merging more than two repositories together with filter-branch

    If you have more than two repositories to do, say R1 (oldest) through R5 (newest), just repeat the git reset and git filter-branch commands in chronological order.

    PARENT_REPO=R1
    for CHILD_REPO in R2 R3 R4 R5; do
        git reset --hard $CHILD_REPO/master
        git filter-branch --parent-filter 'sed "s_^\$_-p '$PARENT_REPO/master'"' HEAD
        PARENT_REPO=$CHILD_REPO
    done
    

    Using grafts

    As an alternative to using the --parent-filter option to filter-branch, you may instead use the grafts mechanism.

    Consider the original situation of appending R2/master as a child of (that is, newer than) R1/master. As before, start by pointing the current branch (master) to the tip of R2/master.

    git reset --hard R2/master
    

    Now, instead of running the filter-branch command, create a "graft" (fake parent) in .git/info/grafts that links the "root" (oldest) commit of R2/master (K) to the tip (newest) commit in R1/master (D). (If there are multiple roots of R2/master, the following will only link one of them.)

    ROOT_OF_R2=$(git rev-list R2/master | tail -n 1)
    TIP_OF_R1=$(git rev-parse R1/master)
    echo $ROOT_OF_R2 $TIP_OF_R1 >> .git/info/grafts
    

    At this point, you can look at your history (say, through gitk) to see if it looks right. If so, you can make the changes permanent via:

    git filter-branch
    

    Finally, you can clean everything up by removing the graft file.

    rm .git/info/grafts
    

    Using grafts is likely more work than using --parent-filter, but it does have the advantage of being able to graft together more than two histories with a single filter-branch. (You could do the same with --parent-filter, but the script would become very ugly very fast.) It also has the advantage of allowing you to see your changes before they become permanent; if it looks bad, just delete the graft file to abort.

    Merging more than two repositories together with grafts

    To use the graft method with R1 (oldest) through R5 (newest), just add multiple lines to the graft file. (The order in which you run the echo commands does not matter.)

    git reset --hard R5/master
    
    PARENT_REPO=R1
    for CHILD_REPO in R2 R3 R4 R5; do
        ROOT_OF_CHILD=$(git rev-list $CHILD_REPO/master | tail -n 1)
        TIP_OF_PARENT=$(git rev-parse $PARENT_REPO/master)
        echo "$ROOT_OF_CHILD" "$TIP_OF_PARENT" >> .git/info/grafts
        PARENT_REPO=$CHILD_REPO
    done
    

    What about git rebase?

    Several others have suggested using git rebase R1/master instead of the git filter-branch command above. This will take the diff between the empty commit and K and then try to apply it to D, resulting in:

    A---B---C---D---K'---L'---M'---N'
    

    This will most likely cause a merge conflict, and may even result in spurious files being created in K' if a file was deleted between D and K. The only case in which this will work is if the trees of D and K are identical.

    (Another slight difference is that git rebase alters the committer information for K' through N', whereas git filter-branch does not.)