gitrepositoryclonesparse-checkout

Shallow AND Sparse GIT Repository Clone


I have a shallow cloned git repository that is over 1 GB. I use sparse checkout for the files/dirs needed.

How can I reduce the repository clone to just the sparse checkout files/dirs?

Initially I was able to limit the cloned repository to only the sparse checkout by disabling checkout when cloning. Then setting up sparse checkout before doing the initial checkout. This limited the repository to only about 200 MB. Much more manageable. However updating remote branch info at some point in the future causes the rest of the files and dirs to be included in the repository clone. Sending the repo clone size back to over 1 GB and I don't know how to just the sparse checkout files and dirs.

In short what I want is a shallow AND sparse repository clone. Not just sparse checkout of a shallow repo clone. The full repo is a waste of space and performance for certain tasks suffers.

Hope someone can share a solution. Thanks.


Solution

  • Shallow and sparse means "partial" or "narrow".

    A partial clone (or "narrow clone") is in theory possible, and was implemented first in Dec 2017 with Git 2.16, as seen here.
    But:

    That is further optimized in Git 2.20 (Q4 2018), since in a partial clone that will lazily be hydrated from the originating repository, we generally want to avoid "does this object exist (locally)?" on objects that we deliberately omitted when we created the (partial/sparse) clone.
    The cache-tree codepath (which is used to write a tree object out of the index) however insisted that the object exists, even for paths that are outside of the partial checkout area.
    The code has been updated to avoid such a check.

    See commit 2f215ff (09 Oct 2018) by Jonathan Tan (jhowtan).
    (Merged by Junio C Hamano -- gitster -- in commit a08b1d6, 19 Oct 2018)

    cache-tree: skip some blob checks in partial clone

    In a partial clone, whenever a sparse checkout occurs, the existence of all blobs in the index is verified, whether they are included or excluded by the .git/info/sparse-checkout specification.
    This significantly degrades performance because a lazy fetch occurs whenever the existence of a missing blob is checked.


    With Git 2.24 (Q4 2019), the cache-tree code has been taught to be less aggressive in attempting to see if a tree object it computed already exists in the repository.

    See commit f981ec1 (03 Sep 2019) by Jonathan Tan (jhowtan).
    (Merged by Junio C Hamano -- gitster -- in commit ae203ba, 07 Oct 2019)

    cache-tree: do not lazy-fetch tentative tree

    The cache-tree datastructure is used to speed up the comparison between the HEAD and the index, and when the index is updated by a cherry-pick (for example), a tree object that would represent the paths in the index in a directory is constructed in-core, to see if such a tree object exists already in the object store.

    When the lazy-fetch mechanism was introduced, we converted this "does the tree exist?" check into an "if it does not, and if we lazily cloned, see if the remote has it" call by mistake.
    Since the whole point of this check is to repair the cache-tree by recording an already existing tree object opportunistically, we shouldn't even try to fetch one from the remote.

    Pass the OBJECT_INFO_SKIP_FETCH_OBJECT flag to make sure we only check for existence in the local object store without triggering the lazy fetch mechanism.


    With Git 2.25 (Q1 2020), "git fetch" codepath had a big "do not lazily fetch missing objects when I ask if something exists" switch.

    This has been corrected by marking the "does this thing exist?" calls with "if not please do not lazily fetch it" flag.

    See commit 603960b, commit e362fad (13 Nov 2019), and commit 6462d5e (05 Nov 2019) by Jonathan Tan (jhowtan).
    (Merged by Junio C Hamano -- gitster -- in commit fce9e83, 01 Dec 2019)

    clone: remove fetch_if_missing=0

    Signed-off-by: Jonathan Tan

    Commit 6462d5eb9a ("fetch: remove fetch_if_missing=0", 2019-11-08) strove to remove the need for fetch_if_missing=0 from the fetching mechanism, so it is plausible to attempt removing fetch_if_missing=0 from clone as well. But doing so reveals a bug - when the server does not send an object directly pointed to by a ref, this should be an error, not a trigger for a lazy fetch. (This case in the fetching mechanism was covered by a test using "git clone", not "git fetch", which is why the aforementioned commit didn't uncover the bug.)

    The bug can be fixed by suppressing lazy-fetching during the connectivity check. Fix this bug, and remove fetch_if_missing from clone.

    And:

    promisor-remote: remove fetch_if_missing=0

    Signed-off-by: Jonathan Tan

    Commit 6462d5eb9a ("fetch: remove fetch_if_missing=0", 2019-11-08) strove to remove the need for fetch_if_missing=0 from the fetching mechanism, so it is plausible to attempt removing fetch_if_missing=0 from the lazy-fetching mechanism in promisor-remote as well.

    But doing so reveals a bug - when the server does not send an object pointed to by a tag object, an infinite loop occurs: Git attempts to fetch the missing object, which causes a deferencing of all refs (for negotiation), which causes a lazy fetch of that missing object, and so on.
    This bug is because of unnecessary use of the fetch negotiator during lazy fetching - it is not used after initialization, but it is still initialized (which causes the dereferencing of all refs).

    Thus, when the negotiator is not used during fetching, refrain from initializing it. Then, remove fetch_if_missing from promisor-remote.


    See more with "Bring your monorepo down to size with sparse-checkout" from Derrick Stolee

    Pairing sparse-checkout with the partial clone feature accelerates these workflows even more.
    This combination speeds up the data transfer process since you don’t need every reachable Git object, and instead, can download only those you need to populate your cone of the working directory

    $ git clone --filter=blob:none --no-checkout https://github.com/derrickstolee/sparse-checkout-example
    Cloning into 'sparse-checkout-example'...
    Receiving objects: 100% (373/373), 75.98 KiB | 2.71 MiB/s, done.
    Resolving deltas: 100% (23/23), done.
     
    $ cd sparse-checkout-example/
     
    $ git sparse-checkout init --cone
    Receiving objects: 100% (3/3), 1.41 KiB | 1.41 MiB/s, done.
     
    $ git sparse-checkout set client/android
    Receiving objects: 100% (26/26), 985.91 KiB | 5.76 MiB/s, done.
    

    Before Git 2.25.1 (Feb. 2020), has_object_file() said "no" given an object registered to the system via pretend_object_file(), making it inconsistent with read_object_file(), causing lazy fetch to attempt fetching an empty tree from promisor remotes.

    See discussion.

    I tried to reproduce this with

    empty_tree=$(git mktree </dev/null)
    git init --bare x
    git clone --filter=blob:none file://$(pwd)/x y
    cd y
    echo hi >README
    git add README
    git commit -m 'nonempty tree'
    GIT_TRACE=1 git diff-tree "$empty_tree" HEAD
    

    and indeed, it looks like Git serves the empty tree even from repositories that don't contain it.

    See commit 9c8a294 (02 Jan 2020) by Jonathan Tan (jhowtan).
    (Merged by Junio C Hamano -- gitster -- in commit e26bd14, 22 Jan 2020)

    sha1-file: remove OBJECT_INFO_SKIP_CACHED

    Signed-off-by: Jonathan Tan

    In a partial clone, if a user provides the hash of the empty tree ("git mktree </dev/null" - for SHA-1, this is 4b825d...) to a command which requires that that object be parsed, for example:

    git diff-tree 4b825d <a non-empty tree>
    

    then Git will lazily fetch the empty tree, unnecessarily, because parsing of that object invokes repo_has_object_file(), which does not special-case the empty tree.

    Instead, teach repo_has_object_file() to consult find_cached_object() (which handles the empty tree), thus bringing it in line with the rest of the object-store-accessing functions.
    A cost is that repo_has_object_file() will now need to oideq upon each invocation, but that is trivial compared to the filesystem lookup or the pack index search required anyway. (And if find_cached_object() needs to do more because of previous invocations to pretend_object_file(), all the more reason to be consistent in whether we present cached objects.)

    As a historical note, the function now known as repo_read_object_file() was taught the empty tree in 346245a1bb ("hard-code the empty tree object", 2008-02-13, Git v1.5.5-rc0 -- merge), and the function now known as oid_object_info() was taught the empty tree in c4d9986f5f ("sha1_object_info: examine cached_object store too", 2011-02-07, Git v1.7.4.1).

    repo_has_object_file() was never updated, perhaps due to oversight.
    The flag OBJECT_INFO_SKIP_CACHED, introduced later in dfdd4afcf9 ("sha1_file: teach sha1_object_info_extended more flags", 2017-06-26, Git v2.14.0-rc0) and used in e83e71c5e1 ("sha1_file: refactor has_sha1_file_with_flags", 2017-06-26, Git v2.14.0-rc0), was introduced to preserve this difference in empty-tree handling, but now it can be removed.


    Git 2.25.1 will also warn programmers about pretend_object_file() that allows the code to tentatively use in-core objects.

    See commit 60440d7 (04 Jan 2020) by Jonathan Nieder (artagnon).
    (Merged by Junio C Hamano -- gitster -- in commit b486d2e, 12 Feb 2020)

    sha1-file: document how to use pretend_object_file

    Inspired-by: Junio C Hamano
    Signed-off-by: Jonathan Nieder

    Like in-memory alternates, pretend_object_file contains a trap for the unwary: careless callers can use it to create references to an object that does not exist in the on-disk object store.

    Add a comment documenting how to use the function without risking such problems.

    The only current caller is blame, which uses pretend_object_file to create an in-memory commit representing the working tree state. Noticed during a discussion of how to safely use this function in operations like "git merge" which, unlike blame, are not read-only.

    So the comment is now:

    /*
     * Add an object file to the in-memory object store, without writing it
     * to disk.
     *
     * Callers are responsible for calling write_object_file to record the
     * object in persistent storage before writing any other new objects
     * that reference it.
     */
    int pretend_object_file(void *, unsigned long, enum object_type,
                struct object_id *oid);
    

    Git 2.25.1 (Feb. 2020) includes a Futureproofing for making sure a test do not depend on the current implementation detail.

    See commit b54128b (13 Jan 2020) by Jonathan Tan (jhowtan).
    (Merged by Junio C Hamano -- gitster -- in commit 3f7553a, 12 Feb 2020)

    t5616: make robust to delta base change

    Signed-off-by: Jonathan Tan

    Commit 6462d5eb9a ("fetch: remove fetch_if_missing=0", 2019-11-08) contains a test that relies on having to lazily fetch the delta base of a blob, but assumes that the tree being fetched (as part of the test) is sent as a non-delta object.
    This assumption may not hold in the future; for example, a change in the length of the object hash might result in the tree being sent as a delta instead.

    Make the test more robust by relying on having to lazily fetch the delta base of the tree instead, and by making no assumptions on whether the blobs are sent as delta or non-delta.


    Git 2.25.2 (March 2020) fixes a bug revealed by a recent change to make the protocol v2 the default.

    See commit 3e96c66, commit d0badf8 (21 Feb 2020) by Derrick Stolee (derrickstolee).
    (Merged by Junio C Hamano -- gitster -- in commit 444cff6, 02 Mar 2020)

    partial-clone: avoid fetching when looking for objects

    Signed-off-by: Derrick Stolee

    While testing partial clone, I noticed some odd behavior. I was testing a way of running 'git init', followed by manually configuring the remote for partial clone, and then running 'git fetch'.
    Astonishingly, I saw the 'git fetch' process start asking the server for multiple rounds of pack-file downloads! When tweaking the situation a little more, I discovered that I could cause the remote to hang up with an error.

    Add two tests that demonstrate these two issues.

    In the first test, we find that when fetching with blob filters from a repository that previously did not have any tags, the 'git fetch --tags origin' command fails because the server sends "multiple filter-specs cannot be combined". This only happens when using protocol v2.

    In the second test, we see that a 'git fetch origin' request with several ref updates results in multiple pack-file downloads.
    This must be due to Git trying to fault-in the objects pointed by the refs. What makes this matter particularly nasty is that this goes through the do_oid_object_info_extended() method, so there are no "haves" in the negotiation.
    This leads the remote to send every reachable commit and tree from each new ref, providing a quadratic amount of data transfer! This test is fixed if we revert 6462d5eb9a (fetch: remove fetch_if_missing=0, 2019-11-05, Git v2.25.0-rc0), but that revert causes other test failures.
    The real fix will need more care.

    Fix:

    When using partial clone, find_non_local_tags() in builtin/fetch.c checks each remote tag to see if its object also exists locally. There is no expectation that the object exist locally, but this function nevertheless triggers a lazy fetch if the object does not exist. This can be extremely expensive when asking for a commit, as we are completely removed from the context of the non-existent object and thus supply no "haves" in the request.

    6462d5eb9a (fetch: remove fetch_if_missing=0, 2019-11-05, Git v2.25.0-rc0, , Git v2.25.0-rc0) removed a global variable that prevented these fetches in favor of a bitflag. However, some object existence checks were not updated to use this flag.

    Update find_non_local_tags() to use OBJECT_INFO_SKIP_FETCH_OBJECT in addition to OBJECT_INFO_QUICK.
    The _QUICK option only prevents repreparing the pack-file structures. We need to be extremely careful about supplying _SKIP_FETCH_OBJECT when we expect an object to not exist due to updated refs.

    This resolves a broken test in t5616-partial-clone.sh.


    The logic to auto-follow tags by "git clone --single-branch" was not careful to avoid lazy-fetching unnecessary tags, which has been corrected with Git 2.27 (Q2 2020),

    See commit 167a575 (01 Apr 2020) by Jeff King (peff).
    (Merged by Junio C Hamano -- gitster -- in commit 3ea2b46, 22 Apr 2020)

    clone: use "quick" lookup while following tags

    Signed-off-by: Jeff King

    When cloning with --single-branch, we implement git fetch's usual tag-following behavior, grabbing any tag objects that point to objects we have locally.

    When we're a partial clone, though, our has_object_file() check will actually lazy-fetch each tag.

    That not only defeats the purpose of --single-branch, but it does it incredibly slowly, potentially kicking off a new fetch for each tag.
    This is even worse for a shallow clone, which implies --single-branch, because even tags which are supersets of each other will be fetched individually.

    We can fix this by passing OBJECT_INFO_SKIP_FETCH_OBJECT to the call, which is what git fetch does in this case.

    Likewise, let's include OBJECT_INFO_QUICK, as that's what git fetch does.
    The rationale is discussed in 5827a03545 (fetch: use "quick" has_sha1_file for tag following, 2016-10-13, Git v2.10.2), but here the tradeoff would apply even more so because clone is very unlikely to be racing with another process repacking our newly-created repository.

    This may provide a very small speedup even in the non-partial case case, as we'd avoid calling reprepare_packed_git() for each tag (though in practice, we'd only have a single packfile, so that reprepare should be quite cheap).


    Before Git 2.27 (Q2 2020), serving a "git fetch" client over "git://" and "ssh://" protocols using the on-wire protocol version 2 was buggy on the server end when the client needs to make a follow-up request to e.g. auto-follow tags.

    See commit 08450ef (08 May 2020) by Christian Couder (chriscool).
    (Merged by Junio C Hamano -- gitster -- in commit a012588, 13 May 2020)

    upload-pack: clear filter_options for each v2 fetch command

    Helped-by: Derrick Stolee
    Helped-by: Jeff King
    Helped-by: Taylor Blau
    Signed-off-by: Christian Couder

    Because of the request/response model of protocol v2, the upload_pack_v2() function is sometimes called twice in the same process, while 'struct list_objects_filter_options filter_options' was declared as static at the beginning of 'upload-pack.c'.

    This made the check in list_objects_filter_die_if_populated(), which is called by process_args(), fail the second time upload_pack_v2() is called, as filter_options had already been populated the first time.

    To fix that, filter_options is not static any more. It's now owned directly by upload_pack(). It's now also part of 'struct upload_pack_data', so that it's owned indirectly by upload_pack_v2().

    In the long term, the goal is to also have upload_pack() use 'struct upload_pack_data', so adding filter_options to this struct makes more sense than to have it owned directly by upload_pack_v2().

    This fixes the first of the 2 bugs documented by d0badf8797 ("partial-clone: demonstrate bugs in partial fetch", 2020-02-21, Git v2.26.0-rc0 -- merge listed in batch #8).


    With Git 2.29 (Q4 2020), the pretend-object mechanism checks if the given object already exists in the object store before deciding to keep the data in-core, but the check would have triggered lazy fetching of such an object from a promissor remote.

    See commit a64d2aa (21 Jul 2020) by Jonathan Tan (jhowtan).
    (Merged by Junio C Hamano -- gitster -- in commit 5b137e8, 04 Aug 2020)

    sha1-file: make pretend_object_file() not prefetch

    Signed-off-by: Jonathan Tan

    When pretend_object_file() is invoked with an object that does not exist (as is the typical case), there is no need to fetch anything from the promisor remote, because the caller already knows what the object is supposed to contain. Therefore, suppress the fetch. (The OBJECT_INFO_QUICK flag is added for the same reason.)

    This was noticed at $DAYJOB when "blame" was run on a file that had uncommitted modifications.


    With Git 2.37 (Q3 2022), "git mktree --missing"(man) lazily fetched objects that are missing from the local object store, which was totally unnecessary for the purpose of creating the tree object(s) from its input.

    See commit 817b0f6 (21 Jun 2022) by Richard Oliver (RichardBray).
    (Merged by Junio C Hamano -- gitster -- in commit 6fccbda, 13 Jul 2022)

    mktree: do not check type of remote objects

    Signed-off-by: Richard Oliver

    With 31c8221 ("mktree: validate entry type in input", 2009-05-14, Git v1.6.4-rc0 -- merge), we called the sha1_object_info() API to obtain the type information, but allowed the call to silently fail when the object was missing locally, so that we can sanity-check the types opportunistically when the object did exist.

    The implementation is understandable because back then there was no lazy/on-demand downloading of individual objects from the promisor remotes that causes a long delay and materializes the object, hence defeating the point of using "--missing".
    The design is hurting us now.

    We could bypass the opportunistic type/mode consistency check altogether when "--missing" is given, but instead, use the oid_object_info_extended() API and tell it that we are only interested in objects that locally exist and are immediately available by passing OBJECT_INFO_SKIP_FETCH_OBJECT bit to it.
    That way, we will still retain the cheap and opportunistic sanity check for local objects.