I have a shallow cloned git repository that is over 1 GB. I use sparse checkout for the files/dirs needed.
How can I reduce the repository clone to just the sparse checkout files/dirs?
Initially I was able to limit the cloned repository to only the sparse checkout by disabling checkout when cloning. Then setting up sparse checkout before doing the initial checkout. This limited the repository to only about 200 MB. Much more manageable. However updating remote branch info at some point in the future causes the rest of the files and dirs to be included in the repository clone. Sending the repo clone size back to over 1 GB and I don't know how to just the sparse checkout files and dirs.
In short what I want is a shallow AND sparse repository clone. Not just sparse checkout of a shallow repo clone. The full repo is a waste of space and performance for certain tasks suffers.
Hope someone can share a solution. Thanks.
Shallow and sparse means "partial" or "narrow".
A partial clone (or "narrow clone") is in theory possible, and was implemented first in Dec 2017 with Git 2.16, as seen here.
But:
That is further optimized in Git 2.20 (Q4 2018), since in a partial clone that will lazily be hydrated from the originating repository, we generally want to avoid "does this object exist (locally)?" on objects that we deliberately omitted
when we created the (partial/sparse) clone.
The cache-tree codepath (which is used to write a tree object out of the index) however insisted that the object exists, even for paths that are outside of the partial checkout area.
The code has been updated to avoid such a check.
See commit 2f215ff (09 Oct 2018) by Jonathan Tan (jhowtan
).
(Merged by Junio C Hamano -- gitster
-- in commit a08b1d6, 19 Oct 2018)
cache-tree
: skip some blob checks in partial clone
In a partial clone, whenever a sparse checkout occurs, the existence of all blobs in the index is verified, whether they are included or excluded by the
.git/info/sparse-checkout
specification.
This significantly degrades performance because a lazy fetch occurs whenever the existence of a missing blob is checked.
With Git 2.24 (Q4 2019), the cache-tree
code has been taught to be less aggressive in attempting to see if a tree object it computed already exists in
the repository.
See commit f981ec1 (03 Sep 2019) by Jonathan Tan (jhowtan
).
(Merged by Junio C Hamano -- gitster
-- in commit ae203ba, 07 Oct 2019)
cache-tree
: do not lazy-fetch tentative tree
The
cache-tree
datastructure is used to speed up the comparison between the HEAD and the index, and when the index is updated by a cherry-pick (for example), a tree object that would represent the paths in the index in a directory is constructed in-core, to see if such a tree object exists already in the object store.When the lazy-fetch mechanism was introduced, we converted this "does the tree exist?" check into an "if it does not, and if we lazily cloned, see if the remote has it" call by mistake.
Since the whole point of this check is to repair the cache-tree by recording an already existing tree object opportunistically, we shouldn't even try to fetch one from the remote.Pass the
OBJECT_INFO_SKIP_FETCH_OBJECT
flag to make sure we only check for existence in the local object store without triggering the lazy fetch mechanism.
With Git 2.25 (Q1 2020), "git fetch
" codepath had a big "do not lazily fetch missing objects when I ask if something exists" switch.
This has been corrected by marking the "does this thing exist?" calls with "if not please do not lazily fetch it" flag.
See commit 603960b, commit e362fad (13 Nov 2019), and commit 6462d5e (05 Nov 2019) by Jonathan Tan (jhowtan
).
(Merged by Junio C Hamano -- gitster
-- in commit fce9e83, 01 Dec 2019)
clone
: removefetch_if_missing=0
Signed-off-by: Jonathan Tan
Commit 6462d5eb9a ("fetch: remove
fetch_if_missing=0",
2019-11-08) strove to remove the need forfetch_if_missing=0
from the fetching mechanism, so it is plausible to attempt removingfetch_if_missing=0
from clone as well. But doing so reveals a bug - when the server does not send an object directly pointed to by a ref, this should be an error, not a trigger for a lazy fetch. (This case in the fetching mechanism was covered by a test using "git clone", not "git fetch", which is why the aforementioned commit didn't uncover the bug.)The bug can be fixed by suppressing lazy-fetching during the connectivity check. Fix this bug, and remove
fetch_if_missing
from clone.
And:
promisor-remote
: removefetch_if_missing=0
Signed-off-by: Jonathan Tan
Commit 6462d5eb9a ("fetch: remove
fetch_if_missing=0",
2019-11-08) strove to remove the need forfetch_if_missing=0
from the fetching mechanism, so it is plausible to attempt removingfetch_if_missing=0
from the lazy-fetching mechanism inpromisor-remote
as well.But doing so reveals a bug - when the server does not send an object pointed to by a tag object, an infinite loop occurs: Git attempts to fetch the missing object, which causes a deferencing of all refs (for negotiation), which causes a lazy fetch of that missing object, and so on.
This bug is because of unnecessary use of the fetch negotiator during lazy fetching - it is not used after initialization, but it is still initialized (which causes the dereferencing of all refs).Thus, when the negotiator is not used during fetching, refrain from initializing it. Then, remove
fetch_if_missing
frompromisor-remote
.
See more with "Bring your monorepo down to size with sparse-checkout" from Derrick Stolee
Pairing sparse-checkout with the partial clone feature accelerates these workflows even more.
This combination speeds up the data transfer process since you don’t need every reachable Git object, and instead, can download only those you need to populate your cone of the working directory
$ git clone --filter=blob:none --no-checkout https://github.com/derrickstolee/sparse-checkout-example
Cloning into 'sparse-checkout-example'...
Receiving objects: 100% (373/373), 75.98 KiB | 2.71 MiB/s, done.
Resolving deltas: 100% (23/23), done.
$ cd sparse-checkout-example/
$ git sparse-checkout init --cone
Receiving objects: 100% (3/3), 1.41 KiB | 1.41 MiB/s, done.
$ git sparse-checkout set client/android
Receiving objects: 100% (26/26), 985.91 KiB | 5.76 MiB/s, done.
Before Git 2.25.1 (Feb. 2020), has_object_file()
said "no
" given an object registered to the system via pretend_object_file()
, making it inconsistent with read_object_file()
, causing lazy fetch to attempt fetching an empty tree from promisor remotes.
I tried to reproduce this with
empty_tree=$(git mktree </dev/null) git init --bare x git clone --filter=blob:none file://$(pwd)/x y cd y echo hi >README git add README git commit -m 'nonempty tree' GIT_TRACE=1 git diff-tree "$empty_tree" HEAD
and indeed, it looks like Git serves the empty tree even from repositories that don't contain it.
See commit 9c8a294 (02 Jan 2020) by Jonathan Tan (jhowtan
).
(Merged by Junio C Hamano -- gitster
-- in commit e26bd14, 22 Jan 2020)
sha1-file
: removeOBJECT_INFO_SKIP_CACHED
Signed-off-by: Jonathan Tan
In a partial clone, if a user provides the hash of the empty tree ("
git mktree
</dev/null
" - for SHA-1, this is 4b825d...) to a command which requires that that object be parsed, for example:git diff-tree 4b825d <a non-empty tree>
then Git will lazily fetch the empty tree, unnecessarily, because parsing of that object invokes
repo_has_object_file()
, which does not special-case the empty tree.Instead, teach
repo_has_object_file()
to consultfind_cached_object()
(which handles the empty tree), thus bringing it in line with the rest of theobject-store-accessing
functions.
A cost is thatrepo_has_object_file()
will now need tooideq
upon each invocation, but that is trivial compared to the filesystem lookup or the pack index search required anyway. (And iffind_cached_object()
needs to do more because of previous invocations topretend_object_file()
, all the more reason to be consistent in whether we present cached objects.)As a historical note, the function now known as
repo_read_object_file()
was taught the empty tree in 346245a1bb ("hard-code the empty tree object", 2008-02-13, Git v1.5.5-rc0 -- merge), and the function now known asoid_object_info()
was taught the empty tree in c4d9986f5f ("sha1_object_info
: examinecached_object
store too", 2011-02-07, Git v1.7.4.1).
repo_has_object_file()
was never updated, perhaps due to oversight.
The flagOBJECT_INFO_SKIP_CACHED,
introduced later in dfdd4afcf9 ("sha1_file
: teachsha1_object_info_extended
more flags", 2017-06-26, Git v2.14.0-rc0) and used in e83e71c5e1 ("sha1_file
: refactorhas_sha1_file_with_flags
", 2017-06-26, Git v2.14.0-rc0), was introduced to preserve this difference in empty-tree handling, but now it can be removed.
Git 2.25.1 will also warn programmers about pretend_object_file()
that allows the code to tentatively use in-core objects.
See commit 60440d7 (04 Jan 2020) by Jonathan Nieder (artagnon
).
(Merged by Junio C Hamano -- gitster
-- in commit b486d2e, 12 Feb 2020)
sha1-file
: document how to usepretend_object_file
Inspired-by: Junio C Hamano
Signed-off-by: Jonathan Nieder
Like in-memory alternates,
pretend_object_file
contains a trap for the unwary: careless callers can use it to create references to an object that does not exist in the on-disk object store.Add a comment documenting how to use the function without risking such problems.
The only current caller is blame, which uses
pretend_object_file
to create an in-memory commit representing the working tree state. Noticed during a discussion of how to safely use this function in operations like "git merge" which, unlike blame, are not read-only.
/*
* Add an object file to the in-memory object store, without writing it
* to disk.
*
* Callers are responsible for calling write_object_file to record the
* object in persistent storage before writing any other new objects
* that reference it.
*/
int pretend_object_file(void *, unsigned long, enum object_type,
struct object_id *oid);
Git 2.25.1 (Feb. 2020) includes a Futureproofing for making sure a test do not depend on the current implementation detail.
See commit b54128b (13 Jan 2020) by Jonathan Tan (jhowtan
).
(Merged by Junio C Hamano -- gitster
-- in commit 3f7553a, 12 Feb 2020)
t5616
: make robust to delta base changeSigned-off-by: Jonathan Tan
Commit 6462d5eb9a ("fetch: remove
fetch_if_missing=0",
2019-11-08) contains a test that relies on having to lazily fetch the delta base of a blob, but assumes that the tree being fetched (as part of the test) is sent as a non-delta object.
This assumption may not hold in the future; for example, a change in the length of the object hash might result in the tree being sent as a delta instead.Make the test more robust by relying on having to lazily fetch the delta base of the tree instead, and by making no assumptions on whether the blobs are sent as delta or non-delta.
Git 2.25.2 (March 2020) fixes a bug revealed by a recent change to make the protocol v2 the default.
See commit 3e96c66, commit d0badf8 (21 Feb 2020) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit 444cff6, 02 Mar 2020)
partial-clone
: avoid fetching when looking for objectsSigned-off-by: Derrick Stolee
While testing partial clone, I noticed some odd behavior. I was testing a way of running '
git init
', followed by manually configuring the remote for partial clone, and then running 'git fetch
'.
Astonishingly, I saw the 'git fetch
' process start asking the server for multiple rounds of pack-file downloads! When tweaking the situation a little more, I discovered that I could cause the remote to hang up with an error.Add two tests that demonstrate these two issues.
In the first test, we find that when fetching with blob filters from a repository that previously did not have any tags, the '
git fetch --tags
origin' command fails because the server sends "multiple filter-specs cannot be combined". This only happens when using protocol v2.In the second test, we see that a '
git fetch
origin' request with several ref updates results in multiple pack-file downloads.
This must be due to Git trying to fault-in the objects pointed by the refs. What makes this matter particularly nasty is that this goes through thedo_oid_object_info_extended()
method, so there are no "haves" in the negotiation.
This leads the remote to send every reachable commit and tree from each new ref, providing a quadratic amount of data transfer! This test is fixed if we revert 6462d5eb9a (fetch: removefetch_if_missing=0,
2019-11-05, Git v2.25.0-rc0), but that revert causes other test failures.
The real fix will need more care.
Fix:
When using partial clone,
find_non_local_tags()
inbuiltin/fetch.c
checks each remote tag to see if its object also exists locally. There is no expectation that the object exist locally, but this function nevertheless triggers a lazy fetch if the object does not exist. This can be extremely expensive when asking for a commit, as we are completely removed from the context of the non-existent object and thus supply no "haves" in the request.6462d5eb9a (
fetch
: removefetch_if_missing=0,
2019-11-05, Git v2.25.0-rc0, , Git v2.25.0-rc0) removed a global variable that prevented these fetches in favor of a bitflag. However, some object existence checks were not updated to use this flag.Update
find_non_local_tags()
to useOBJECT_INFO_SKIP_FETCH_OBJECT
in addition toOBJECT_INFO_QUICK
.
The_QUICK
option only prevents repreparing the pack-file structures. We need to be extremely careful about supplying_SKIP_FETCH_OBJECT
when we expect an object to not exist due to updated refs.This resolves a broken test in
t5616-partial-clone.sh.
The logic to auto-follow tags by "git clone --single-branch
" was not careful to avoid lazy-fetching unnecessary tags, which has been corrected with Git 2.27 (Q2 2020),
See commit 167a575 (01 Apr 2020) by Jeff King (peff
).
(Merged by Junio C Hamano -- gitster
-- in commit 3ea2b46, 22 Apr 2020)
clone
: use "quick" lookup while following tagsSigned-off-by: Jeff King
When cloning with
--single-branch
, we implementgit fetch
's usual tag-following behavior, grabbing any tag objects that point to objects we have locally.When we're a partial clone, though, our
has_object_file()
check will actually lazy-fetch each tag.That not only defeats the purpose of
--single-branch
, but it does it incredibly slowly, potentially kicking off a new fetch for each tag.
This is even worse for a shallow clone, which implies--single-branch
, because even tags which are supersets of each other will be fetched individually.We can fix this by passing
OBJECT_INFO_SKIP_FETCH_OBJECT
to the call, which is whatgit fetch
does in this case.Likewise, let's include
OBJECT_INFO_QUICK,
as that's whatgit fetch
does.
The rationale is discussed in 5827a03545 (fetch: use "quick"has_sha1_file
for tag following, 2016-10-13, Git v2.10.2), but here the tradeoff would apply even more so because clone is very unlikely to be racing with another process repacking our newly-created repository.This may provide a very small speedup even in the non-partial case case, as we'd avoid calling
reprepare_packed_git()
for each tag (though in practice, we'd only have a single packfile, so that reprepare should be quite cheap).
Before Git 2.27 (Q2 2020), serving a "git fetch
" client over "git://
" and "ssh://
" protocols using the on-wire protocol version 2 was buggy on the server end when the client needs to make a follow-up request to e.g. auto-follow tags.
See commit 08450ef (08 May 2020) by Christian Couder (chriscool
).
(Merged by Junio C Hamano -- gitster
-- in commit a012588, 13 May 2020)
upload-pack
: clearfilter_options
for each v2 fetch commandHelped-by: Derrick Stolee
Helped-by: Jeff King
Helped-by: Taylor Blau
Signed-off-by: Christian Couder
Because of the request/response model of protocol v2, the
upload_pack_v2()
function is sometimes called twice in the same process, while 'structlist_objects_filter_options
filter_options
' was declared as static at the beginning of 'upload-pack.c
'.This made the check in
list_objects_filter_die_if_populated()
, which is called byprocess_args()
, fail the second timeupload_pack_v2()
is called, asfilter_options
had already been populated the first time.To fix that,
filter_options
is not static any more. It's now owned directly byupload_pack()
. It's now also part of 'structupload_pack_data
', so that it's owned indirectly byupload_pack_v2()
.In the long term, the goal is to also have
upload_pack()
use 'structupload_pack_data
', so addingfilter_options
to this struct makes more sense than to have it owned directly byupload_pack_v2()
.This fixes the first of the 2 bugs documented by d0badf8797 ("
partial-clone
: demonstrate bugs in partial fetch", 2020-02-21, Git v2.26.0-rc0 -- merge listed in batch #8).
With Git 2.29 (Q4 2020), the pretend-object
mechanism checks if the given object already exists in the object store before deciding to keep the data in-core, but the check would have triggered lazy fetching of such an object from a promissor remote.
See commit a64d2aa (21 Jul 2020) by Jonathan Tan (jhowtan
).
(Merged by Junio C Hamano -- gitster
-- in commit 5b137e8, 04 Aug 2020)
sha1-file
: makepretend_object_file()
not prefetchSigned-off-by: Jonathan Tan
When
pretend_object_file()
is invoked with an object that does not exist (as is the typical case), there is no need to fetch anything from the promisor remote, because the caller already knows what the object is supposed to contain. Therefore, suppress the fetch. (TheOBJECT_INFO_QUICK
flag is added for the same reason.)This was noticed at
$DAYJOB
when "blame
" was run on a file that had uncommitted modifications.
With Git 2.37 (Q3 2022), "git mktree --missing
"(man) lazily fetched objects that are missing from the local object store, which was totally unnecessary for the purpose of creating the tree object(s) from its input.
See commit 817b0f6 (21 Jun 2022) by Richard Oliver (RichardBray
).
(Merged by Junio C Hamano -- gitster
-- in commit 6fccbda, 13 Jul 2022)
mktree
: do not check type of remote objectsSigned-off-by: Richard Oliver
With 31c8221 ("
mktree
: validate entry type in input", 2009-05-14, Git v1.6.4-rc0 -- merge), we called thesha1_object_info()
API to obtain the type information, but allowed the call to silently fail when the object was missing locally, so that we can sanity-check the types opportunistically when the object did exist.The implementation is understandable because back then there was no lazy/on-demand downloading of individual objects from the promisor remotes that causes a long delay and materializes the object, hence defeating the point of using "
--missing
".
The design is hurting us now.We could bypass the opportunistic type/mode consistency check altogether when "
--missing
" is given, but instead, use theoid_object_info_extended()
API and tell it that we are only interested in objects that locally exist and are immediately available by passingOBJECT_INFO_SKIP_FETCH_OBJECT
bit to it.
That way, we will still retain the cheap and opportunistic sanity check for local objects.