javagitgithubbfg-repo-cleaner

How to use BFG Repo-Cleaner


I've been advised to use the BFG Repo-Cleaner as my local repo that I want to push contains files too large to push to GitHub. These files (above about 50MB) I don't mind if they get deleted and I accidentally committed them a while back.

On the online instructions: https://rtyley.github.io/bfg-repo-cleaner/

It suggests I should clone a fresh copy of my repo using the --mirror flag (this is seemingly an online version, not the local version). Then to do the Java -jar bfg.jar ... command. And following this to cd back into that local mirror copy of the online repo, and then to push the information back.

I don't quite understand how this applies for local copies. For local copies that are too big to push should I e.g. do:

git clone --mirror /Users/me/myrepo

java -jar bfg.jar --strip-blobs-bigger-than 100M /Users/me/myrepomirror.git

Then I don't also understand how the next steps:

cd /Users/me/myrepomirror.git git reflog expire --expire=now --all && git gc --prune=now --aggressive git push

would address anything to do with my non-mirrored local repo:

/Users/me/myrepo

I am not sure if they imply that I should then do after this:

java -jar bfg.jar --strip-blobs-bigger-than 50M my-repo.git

And again I do not know how this addresses the actual repo (not a mirror or an online version) that I want to prune so that I can push it.

Perhaps I am being a bit dull? The documentation doesn't seem very explicit/extensive for something so potentially useful. Any help here would be great. Thanks!


Solution

  • I've never used BFG before. It sounds useful if you're in this situation of having large files that you need to remove. However, I'll try to explain the overall process, as I understand it.

    Before we begin, note that BFG will rewrite the history of the the remote repository, and pushing it will require everyone on your team to re-clone the repository and transfer their local-only branches over.

    According to git's documentation, git clone --mirror

    Set up a mirror of the source repository. This implies --bare. Compared to --bare, --mirror not only maps local branches of the source to local branches of the target, it maps all refs (including remote-tracking branches, notes etc.) and sets up a refspec configuration such that all these refs are overwritten by a git remote update in the target repository.

    This means that the clone will create an exact copy of the remote repository on your machine. As the BFG docs say, you should create a backup of this clone in case you need it later.

    java -jar bfg.jar --strip-blobs-bigger-than 100M some-big-repo.git
    

    Will target the clone you made with git clone --mirror and will clean all commits of files containing > 100M except the most recent commit (as mentioned in the BFG docs). BFG won't delete the old data automatically. It will stop, let you confirm everything looks good and then leave you to clean up the rest.

    cd /Users/me/myrepomirror.git 
    

    Will navigate to the bare repository. You may have to change the path accordingly.

    git reflog expire --expire=now --all && git gc --prune=now --aggressive
    

    Let's break this command up into it's two logical parts:

    1. git reflog expire --expire=now --all
      • The expire subcommand will prune older reflog entries. The reflog is a log of the refs the HEAD has pointed to. --expire=now tells git to expire all reflogs prior to the current time.
      • --all means across all references. Without --all, the expiration would only happen for the branch you're currently on, rather than all branches.
    2. git gc --prune=now --aggressive
      • git gc handles garbage collection for git. Normally, it'll run in the background on its own, but it is useful to be able to run it sometimes.
      • --prune=now tells git gc to remove loose objects prior to the current time.
      • --aggressive will cause git gc to spend more time cleaning the repository of unnecessary files and provide greater optimization. The git gc docs have some additional info on it.

    Once all of that is done, git push will overwrite the remote version of all of the branches with the newly cleaned ones.

    You would now have to re-clone the repository in a different directory with git clone to obtain a non-bare version.

    Essentially what we've done with this process is create a copy of the remote repository, removed the offending files and rewritten the commit history in the process, pushed the rewritten remote and overwritten what was there previously, and cloned a new copy of that repository for us to continue working.

    Preventative measures

    I'd suggest some preventative measures to avoid having to constantly remove these files. BFG shouldn't be run frequently, since it rewrites the repository's history.

    Unfortunately, .gitignore doesn't support ignoring files larger than a given size. However, there may be some options available to you, regardless.

    1. If all of these large files have a particular file extension or are in a specific directory, simply add them to the .gitignore file to prevent git from tracking them.
    2. Create a pre-commit hook which will prevent files above a certain size from being added. There seems to be a script (I haven't tested it) in response to this SO post.
      • This is a client-side githook, meaning it will need to be distributed to other developers on your team.