gitjupyter-notebookgit-filter-repo

Conciliating Git histories after rewriting history to remove Jupyter Notebook outputs


I'm leading a project with some students, where we use Jupyter notebooks for explanations and test cases in a private GitHub repository. Initially, we committed all notebooks, including outputs, but this became impractical due to large file sizes. We implemented a git filter using nbstripout to automatically strip outputs from new commits, but old outputs still clutter the history.

I'm planning on using git-filter-repo to strip all Jupyter notebook outputs from history. However, my main concern is how to reconcile the rewritten history with students' local changes, without them losing work or facing complex merge conflicts.

What I have found so far seems to point out that it is not possible to do this without messing the history of everyone massively (for instance, this answer by @torek).


Solution

  • So... Let's assume, for simplicity's sake that you keep a single branch against which all work is merged.... And users create their own branches from it. Let's say that from your repo, that branch is called X. Create a new branch called X-old from it and push it into the remote repo.

    So, you run git filter-repo and you get an X branch that is 1 to 1, commit per commit, an equivalent of X-old but it doesn't have the bad stuff. You force-push this new branch into the remote so that the remote X is now clean from the files you removed from history.... Now the real fun begins.

    Each one of your students will be able to see the two branches when they run a git fetch and all they need to do is move the other branches that they hold privately currently on top of X-old to have their commits placed on top of X (the new X).

    This is done like this:

    git rebase origin/X-old some-branch --onto origin/X
    

    Which is read something like this: "git, please, would you be so kind as to take the commits that make up the history of some-branch that are not in origin/X-old and place them on top of origin/X moving some-branch to the last commit of this work when you are done? Here's 20 if you do that for me."

    And tadaaaaaaaaa they have one branch on top of the new X and they can continue working as usual. If they want to push some-branch into a remote and they had already pushed previous work on that remote branch back when it was based on top of X-old, then they will need to force-push because history of some-branch has been rewritten and git does not allow, by default, to push onto a branch when that happens.

    Corner cases: