gitgit-bundle

Splitting git bundle file


Is there any way to split a git bundle file? In say, repo.bundle1 and repo.bundle2, each containing half of the repo. The portable bundle is too large in size for transfer.

How else could I approach this assuming the maximum size allowed for transfer cannot be altered.


Solution

  • Bundles can be incremental.

    They can't have dangling commits, so there is a bit of a game you have to play if you want to incrementally bundle an existing branch.

    They have to be applied "in order" so that as you apply a bundle, its root commits' parents are available to latch onto. (There may be a way to get around this with shallow repos, but if you're trying to ultimately reconstruct the entire repo then you won't want to worry about that.)

    And of course if any single commit is too large (e.g. due to commit of a very large file) that will be a problem.

    Say you have

           x -- x -- x <--(branch1)
          /
    A -- B -- C -- D -- E -- F -- G -- H -- I -- J <--(master)
                    \      /
                     o -- o <--(branch2)
    

    And say you want to break this into bundles of no more than 3 commits. So let's start at the root. We're going to progressively move the master branch, so let's keep track of its current position.

    git checkout master
    git tag real_master
    

    Now we look up the SHA ID for C (or find some other name that refers to C, such as in this case master~7) and then

    git reset --hard master~7
    

    Note that I'm using hard resets; that's probably not necessary, but I'm making the assumption that you can do this from a repo with a clean work tree, and in that case doing hard resets keeps everything in nice, simple states (as I see it, anyway).

    We're ready to create our first bundle

    git bundle create 0.bundle master
    

    This bundle includes B, which is the root for branch1, so we can bundle up branch1 now.

    git bundle create 1.bundle master..branch1
    

    This is equivalent to

    git bundle create 1.bundle ^master branch1
    

    Either way, we're saying to assume that the receiving repo already has the ocmmits reachable from master, so only the x commits will be placed in this bundle.

    It might seem like D, E, F is the next logical step; but F depends on the o commits in brnach2. So really the next logical thing would be to bundle branch2 along with D. Since we still have master at C we can say

    git bundle create 2.bundle master..branch2
    

    Now we need to move master to G so that we can bundle E, F, and G. Make sure we're on master and

    git reset --hard real_master~3
    git bundle create 3.bundle ^branch2 ^master~3 master
    

    Here I'm noting that both older mainline history and branch2 history are reachable from master (by way of the merge at F), but since they're both already bundled I exclude both of them.

    Finally,

    git reset --hard real_master
    git tag -d real_master
    git bundle create 4.bundle master~3..master
    

    In practice you probably would use more than 3 commits per bundle. If you have a side-branch that's too big on its own, you can break it up using the same technique we used to segment master in this example.

    Now you can transfer these independently, and fetch (or pull) from them in order to reconstruct the repo on the other end.

    UPATES

    Two additional notes:


    First, as compared to ElpieKay's suggestion to use dd and cat, the above approach has pros and cons.

    It only relies on git itself (though the utilities needed for the dd/cat approach typically ship with git).

    The individual bundle files are each useful by themselves, whereas if you segment the file with dd you have to reconstruct all the parts to be sure you have a usable bundle. This also means you could save the bundles and combine them with additional bundles you create later (as more changes happen); but that would only matter if you need to create another new remote repo from scratch at that point.

    Actually just shipping incremental changes back and forth, where both sides already have a common baseline of commits, is the basic use case for bundles. So you might decide to use the dd/cat approach to initially create the remote repo, then use incremental bundles for subsequent sharing of updates.

    The biggest advantages of the dd/cat approach is that it's very rote / scriptable (i.e. simple assuming the tools are on hand), whereas you have to think about how to partition up the commits for the above approach; and also the dd approach can split a single, obnoxiously large commit if it turns out to be one.


    I also forgot to mention initially, that you can list multiple branches to be included in a bundle. So for example if your threshold were more like 8 commits per bundles, you could

    1) Move master to E

    2) bundle master and branch1 as 0.bundle

    3) Move master back to J

    4) bundle master excluding master~5 as 1.bundle

    and be done.