I'm leading a project with some students, where we use Jupyter notebooks for explanations and test cases in a private GitHub repository. Initially, we committed all notebooks, including outputs, but this became impractical due to large file sizes. We implemented a git filter using nbstripout
to automatically strip outputs from new commits, but old outputs still clutter the history.
I'm planning on using git-filter-repo
to strip all Jupyter notebook outputs from history. However, my main concern is how to reconcile the rewritten history with students' local changes, without them losing work or facing complex merge conflicts.
What I have found so far seems to point out that it is not possible to do this without messing the history of everyone massively (for instance, this answer by @torek).
So... Let's assume, for simplicity's sake that you keep a single branch against which all work is merged.... And users create their own branches from it. Let's say that from your repo, that branch is called X
. Create a new branch called X-old
from it and push it into the remote repo.
So, you run git filter-repo
and you get an X
branch that is 1 to 1, commit per commit, an equivalent of X-old
but it doesn't have the bad stuff. You force-push this new branch into the remote so that the remote X
is now clean from the files you removed from history.... Now the real fun begins.
Each one of your students will be able to see the two branches when they run a git fetch
and all they need to do is move the other branches that they hold privately currently on top of X-old
to have their commits placed on top of X
(the new X
).
This is done like this:
git rebase origin/X-old some-branch --onto origin/X
Which is read something like this: "git
, please, would you be so kind as to take the commits that make up the history of some-branch
that are not in origin/X-old
and place them on top of origin/X
moving some-branch
to the last commit of this work when you are done? Here's 20 if you do that for me."
And tadaaaaaaaaa they have one branch on top of the new X
and they can continue working as usual. If they want to push some-branch
into a remote and they had already pushed previous work on that remote branch back when it was based on top of X-old
, then they will need to force-push because history of some-branch
has been rewritten and git does not allow, by default, to push onto a branch when that happens.
Corner cases:
git rebase --interactive
to remove them from those commits. There are a lot of answers on how to do that.git rm
those files as there is no need to track them.