There's an old commit in my local repository which added some files, including one called "unwanted.txt". In subsequent commits, that file has been modified, along with others. Is it possible to completely remove the file "unwanted.txt" from history using interactive git rebase? I know it's possible to achieve this using "git filter-branch", but since I am learning git and I want to understand the full potential of "git rebase -i", I wonder if this command can be used for such an operation.
It's possible in theory, but in practice it's usually much too painful.
The method is the same in both rebase and filter-branch. It may help if you realize that all that an interactive rebase is, is git cherry-pick
on steroids, as it were; and git filter-branch
is simply an automated extra-complicated rebase across multiple branches and with merge preservation.
As usual with git, it mostly boils down to manipulating the commit graph, and adding new commits that look like existing commits but with something changed—in this case, the trees attached to those commits. (And as soon as one commit is different, it gets a different SHA-1, which means all subsequent commits must change as well, to list the different SHA-1s that pop into existence as the new graph grows.)
To see how it works, start by drawing the commit graph. You'll need a fairly complete graph depending on how far back you have to go to stop seeing the unwanted.txt
file. But I'll just draw a simple graph, with just one named branch, master
:
I - A - B - C - F <-- master
\ /
D - E
Here I
is the initial commit; for simplicity let's say it does not have the unwanted file. Let's say instead that this file was introduced in commit A
and modified in C
and E
.
What we need to do is this:
I
(preserving commit author and committer, and date stamps, and so on) while removing the unwanted file, i.e., altering the source tree attached to I
if needed. This just gives us commit I
back so we retain its original SHA-1.A
while removing the unwanted file. This results in a new, different commit A'
because we change A
's tree to a new tree that has the file removed. We get a new SHA-1 cryptographic checksum because the new commit is different from the old. So we save an entry in a map that says "old commit A
replaced by new commit A'
.B
while removing the unwanted file. This changes the tree (remember, each commit has a complete snapshot of the entire source, so the unwanted file is in the original B
). Make a new commit B'
that has the altered tree and has commit A'
as its parent ID.C
while removing the unwanted file, resulting in C'
.D
with our changes, resulting in D'
. (Note that we cannot copy F
until we've copied all its predecessors in the graph, in this case C
and E
.)E
with our changes.F
with our changes. The new commit F'
has C'
and E'
as its two parents; we find these using the SHA-1 mapping that we've been constructing all along.master
to point to commit F'
, abandoning the original commit F
.This results in a graph that looks like this:
A - B - C - F [abandoned]
/ \ /
/ D - E
/
I - A' - B' - C' - F' <-- master
\ /
D' - E'
An interactive rebase with --preserve-merges
can handle this particular case. If there's more than one branch, though, you have to carefully rebase the additional branches with --onto
as needed to make use of the new commits, which you have to match up with the old commits, most likely using an SHA-1 map file that you construct manually as you go.
There's an additional wrinkle, which is that git commit
by default refuses to make "empty" commits, where "empty" is defined as "has the same tree as the previous commit" (and is not a merge). The filter-branch script handles this automatically for you, mapping multiple new commits to a single old commit if you choose to delete empty commits (a commit that only modifies the unwanted file becomes empty when the previous and new commits both give up the unwanted file). An interactive rebase does not handle this very well when preserving merges, so that imposes even more pain.
There are some other subtle differences: for instance, when rebase "abandons" a chain of commits, they remain in the "reflog" for the branch that has been rebased, as well as in the reflog for HEAD
. The filter-branch script uses a different method: it copies all the references to a sub-name-space, refs/original/
. This all matters when you get to the point of wanting to purge the old, abandoned commits: with rebase, you "expire" old references, but with filter-branch, you forcibly remove the originals instead.