git blob bfg-repo-cleaner git-filter-repo

Should I use git-filter-repo on the repo clone or on the self hosted bare repo?

There's a self hosted git repository on a Windows Server (Bonobo based if anyone interested). The repository got bloated up because of binary blobs and I'd like to strip out these large blobs along with their whole history.

I was looked at bfg / git filter-branch, bfg-ish, and git filter-repo. My question I think is invariant of these however it sounds like git filter-repo is the most advised.

The big question: should I execute the --strip-blobs-bigger-than 4M on the repository clone (working copy), or should I go straight ahead and manipulate the hosted bare repo what the Bobono manages? If I execute it on the client clone than how will the changes propagate into Bonobo? These changes will be pretty fundamental, will they be even committable?

I already backed up everything, did some filter-repo analysis. I included the blobs in gitignore (although their modification still show as a change).

Solution

I ended up operating on the hosted bare repository. It looks like filter-repo is intended to be used on a clean clone of a repository:

git filter-repo --strip-blobs-bigger-than 4M
Aborting: Refusing to destructively overwrite repo history since
this does not look like a fresh clone.
  (expected freshly packed repo)
Please operate on a fresh clone instead.  If you want to proceed
anyway, use --force.

So I retried on a clean clone and the instruction ran, but then I was clueless what to do next. There were no file changes per se to commit or push, the "meta data" was modified. The operation also interestingly stripped [remote "origin"] and [branch "master"] from the .git/config so I needed to re-establish remote and branch.

So I decided to just go ahead and modify the hosted bare repo. The tool recognizes that it is not a clean clone:

warning: no corresponding .pack: ./objects/pack/pack-f8fc2556f0b95c1a66219fe3ad3fe41d6319a985.idx
Aborting: Refusing to destructively overwrite repo history since
this does not look like a fresh clone.
  (expected freshly packed repo)
Please operate on a fresh clone instead.  If you want to proceed
anyway, use --force.

With forcing the meta data size decreased from 1.3GB to 150MB, similarly as it was executed on the clean clone meta data.

> git filter-repo --force --strip-blobs-bigger-than 4M
Processed 19965 blob sizes
Parsed 3536 commits
New history written in 1.44 seconds; now repacking/cleaning...
Repacking your repo and cleaning out old unneeded objects
Enumerating objects: 42458, done.
Counting objects: 100% (42458/42458), done.
Delta compression using up to 8 threads
Compressing objects: 100% (12993/12993), done.
Writing objects: 100% (42458/42458), done.
Selecting bitmap commits: 3257, done.
Building bitmaps: 100% (137/137), done.
Total 42458 (delta 33284), reused 37896 (delta 29067), pack-reused 0
Removing duplicate objects: 100% (256/256), done.
Completely finished after 10.20 seconds.

This happens to be a Windows environment, I started off of a clean clone after that, and I had to re-trust the repository in Visual Studio and all that. So far I could push some changes and I'll report back if anything seems to not work.

It's another story if you are dealing with a repository managed by GitHub or other git services, in this case you won't have direct access to the bare repository they manage. Not sure what happens in that case. I guess you can push the meta data change somehow? Someone should comment.