I'm processing a lot of Git repositories with software I wrote. The basic process is like this:
git clone --no-checkout --filter=blob:none <url>
git fetch --prune --prune-tags --force
to update the data with the newest changes.This works, but since I'm processing thousands of repos, even though I use --filter=blob:none
, the repositories still take up a lot of disk space. (more than a terabyte). And I don't need that data. Once I have processed a git object (commit, tree or tag) I don't need it anymore.
Is there a way to delete all, or most of the objects in the repository, while keeping the ability to fetch changes? And also not have to fetch those objects again?
I've looked at shallow clones, promise files and replace references, but it's all very complicated and every command/option seems to be doing something that is just a little bit different from what I need.
Git doesn't really provide a command to delete any object in the object store. At most, you can use git gc
to remove dangling (unreachable) objects, but this is not your case.
In your scenario, once you've processed all the data, you could keep track of the current top commit for each ref, delete the whole repo, and then clone it again with the option --depth=1
. Since --depth
implies --single-branch
, the option --no-single-branch
is necessary to fetch the histories near the tips of all branches.
git clone --no-checkout --filter=blob:none --no-single-branch --depth=1 <url>
However, after the second clone, make sure that the head commit of each branch corresponds to the last tip you've processed. If the tips don't correspond, you could force fetch until the previous head is included in the object store, incrementing the depth of a certain amount at each iteration.
git fetch --force --depth=<n> origin <branch>