gitgo-git

Can git fetch pack be instructed to fetch a single tree object?


In short, is there a way for me to efficiently (space wise) specify the exact objects I want from a git server that only supports the smart protocol but not the filter-spec?

More context: For GitHub's lack of filter-spec support in the pack protocol, I've been trying to construct a way to fetch a multi gigabyte repository where a single commit also comprises of multiple gigabytes. My idea was to use fetch pack requests (or upload pack on server) that specify a want of only a single commit object and from there getting that object, getting the tree it references, getting the tree object in another request, and then manually specifying which blob and tree objects I want from there. What I've discovered though is that the pack protocol seems to operate from the perspective of delivering as much data as it can for a particular commit or tree that you "want".

What this means for what I'm doing is anytime I specify a commit of a tree hash, I get not just the commit or tree object(s) but also every object they contain as well. This also happens while using the deepen settings to limit how many commits I want; 0 yields nothing and 1 yields the aforementioned result. I have verified that specifying a want of just a blob does result in a pack file with just that blob so that part does work as expected.


Solution

  • What you're requesting isn't possible in the Git protocol unless the filter functionality is enabled.

    The Git protocol is and always has been designed to efficiently exchange a set of commits. The way that Git implements the protocol on the server side for fetches is that it marks the client's have commits as uninteresting and then walks the revisions from what's requested down to the uninteresting points, including all the necessary objects reachable between those points. This approach necessarily requires that the points you're walking be commits.

    It is possible to send a request for a tree object, but the server side won't do what you expect. You'll end up with that tree and everything reachable from it (all the blobs and other trees) in the pack, which is going to be significantly more data than you're wanting. Again, this makes perfect sense if you think about how the Git protocol works: the user has requested all of the objects reachable from this point.

    You can specify that you have certain tree objects so as to exclude them, but of course that requires that you know what they are, which in this case you don't. Even so, you'd still receive the blobs that exist within that level of the hierarchy.

    The filter functionality just adjusts the objects that are included in the pack, so you can specify that only the one tree object is to be included by excluding everything below its depth. These arguments are passed to git rev-list --objects so that the pack generation will exclude the things you're not interested in. Otherwise, the default is to include every reachable object within the range you've requested.