I've started using datalad, a wrapper for git annex, to version control data and expirements in my lab. It works great except the .git folder can silently grow enormous, especially when going back and forth in git history to repeat certain steps. For example, sometimes I make a commit, realize I need to fix something, so roll it back with git reset HEAD~
then make additional commits from there. This orphans the commit that was formerly the HEAD so it doesn't appear in git log
but all its associated files will still be in the annex and if you have the commit sha you can still git show
it. How can I delete these orphaned commits permanently so they and their associated files aren't taking up disk space? I tried git gc --prune=now --aggressive
but that seemingly did nothing.
For example:
datalad create test
cd test
# create new branch
git branch tmp
git checkout tmp
# build up a git history to play with
echo a > f
datalad save -m a
datalad run -i . -o . bash -c "echo aa > f"
datalad run -i . -o . bash -c "echo aaa > f"
# cat all annexed files (where symlinks point)
find .git/annex/objects -type f | xargs -I{} cat {}
# prints out:
# a
# aaa
# aa
# remove last 2 commits
git reset --hard HEAD~2
# make another commit from 2 commits ago
datalad run -i . -o . bash -c "echo b > f"
# print out git annex'd files again
find .git/annex/objects -type f | xargs -I{} cat {}
# should print
# a
# aaa
# b
# aa
# everything is still there, despite the git reset --hard
git checkout master
git branch -D tmp
git gc --prune=now --aggressive
# check what's there again
find .git/annex/objects -type f | xargs -I{} cat {}
# everything is still in the annex, even after deleting the branch and running git gc!
Solved on neurostars: https://neurostars.org/t/how-to-permanently-delete-a-commit-in-git-annex/18235
git gc
would take care about removing the commits from.git/objects
but annex’ed files under.git/annex/objects
would indeed persist. For annexed files, you can use git annex unused to find annexed files which are no longer used in the refs you specify (so you could e.g. drop data for intermediate steps between tagged “releases”) and then usegit annex drop --unused
. Note, that git-annex branch would still keep that in its history. So if you are to do it thousands of times, it might be not a complete solution and you might may be compliment it withgit annex forget
to forget the history of annex entirely