gitgit-stagegit-plumbing

Revert previously staged changes (or: undo changes to .git/index)


When trying to understand the ways to undo various git operations I came up with a scenario where I'm not sure how to deal with it. Disclaimer: I did not have this situation when actually working with git 'in production' but I'd still think it's not only an academic question.

Let's look at the following scenario

What I think is happening under the hood

Every time when staging changes with git add a blob object is created under .git/objects/ and the index file (.git/index) gets updated. If I change and add things multiple times there will be multiple blobs. The old ones aren't immediatly garbage collected.

When running the checkout command from above the index gets update immediately (also I would have assumed that the content would only be in my working directory but unstaged). This way the reference is gone and I cannot use things like git checkout-index to revert them.

Unless garbage collection kicks in the content is still there technically. But I don't know how I would get it back other then manually trying to find the hash somehow and reading the content with git cat-file. The same would e.g. be true for running git add multiple times although here wanting back the previously staged changes maybe isn't really a usecase. (Or maybe when popping changes from stash? ...)


So all of this boils down to these questions:

And if the answers are "No" / "Yes" (what I assume so far):

Bonus: Is there an alternative way to checkout a single file without instantaneously staging it?


Solution

  • Your under-the-hood description is mostly right. The only things that aren't 100% have to do with this part:

    Every time when staging changes with git add a blob object is created under .git/objects/

    Internally, git add hashes the content of the data in the work-tree file, a la git hash-object -w -t blob. This doesn't necessarily create a new object: if the hashed content is already in the repository, it just re-uses the existing object. The existing object might be packed, i.e., in .git/objects/pack, rather than loose as a separate blob.

    Moreover, the content written into a blob object might be arbitrarily different from the content in the work-tree due to a clean filter. It is, more often, CR-LF-line-ending-different from the content in the work-tree due to line-ending settings. Clean filters and end-of-line settings are controlled partly (or mostly, depending on your usage of Git) through your .gitattributes file, and partly (or mostly) through settings in your configuration.

    In any case what matters is that you get a hash ID for a blob object. The blob object definitely exists somewhere—in the .git/objects directory as a loose object, or in a pack file. Now git add can write into .git/index (or whatever other file GIT_INDEX_FILE indicates): it will store, in the index at staging slot zero, an entry for the given path, using the computed blob-hash and mode 100644 or 100755 depending on whether the work-tree file should be marked executable later.

    If you've lost it, you're mostly out of luck

    [Scenario snipped, but it ends with git checkout HEAD -- path clobbering the index entry, with its $path represents $blobhash and mode $mode information, and clobbering the work-tree copy of the file in path.)

    Unless garbage collection kicks in the content is still there technically. But I don't know how I would get it back other then manually trying to find the hash somehow and reading the content with git cat-file.

    Indeed, you can't: the hash ID computation is a trapdoor function, and only if you do have the hash can you have Git spill out the content, but you need to have the content if you don't have the hash. That's your Catch-22 situation.

    If—this is a pretty important "if"—the content was unique, so that git add really did create a new blob object, and you've just overwritten the blob reference that was in the index, that blob object is indeed no longer referenced anywhere. On the other hand, if git hash-object -w wound up reusing some existing blob, the blob object is still referenced by whatever referenced it before. So there are now two interesting cases: the blob was unique and is now eligible for garbage collection, or, the blob was not unique and is not.

    Using git fsck --lost-found or git fsck --unreachable or git fsck --dangling (the default), you can have Git walk the entire object database, determine which objects are reachable and which are not, and tell you about some or all of the unreachable ones and/or copy information from or about them into .git/lost-found. If the blob object was unreachable, it will be listed as one of these unreachable or dangling blobs, or have its contents restored into .git/lost-found.

    The drawback here is that there may be dozens or even hundreds of dangling blob objects. Your task has now switched from "guess the hash" (virtually impossible) to "find the needle in the haystack" (not as difficult, but tedious, and you might well find the wrong needle—it's not really a haystack, it's a stack of needles, after all). And, of course, this only works for the "blob was unique" case.

    Answers to specific questions

    (This, by the way, is where this question isn't really a duplicate of Can git undo a checkout of unstaged files. But that one is still useful, so see it too.)

    Is there something like git reflog for the index?

    No. You can make your own backup copies: just cp .git/index somewhere. But Git doesn't do this on its own. You might make one just before a git checkout HEAD -- path operation, through some alias or shell-function that you use to do this sort-of-dangerous operation.

    Note that Git is not aware of these backup copies, so git gc won't consider referenced objects protected. To use the backups with plumbing commands like git ls-files, put the path name into GIT_INDEX_FILE for the duration of that command.

    Is git checkout @ -- file considered to be a dangerous command like git reset --hard where you could potentially lose your work?

    The answer to that depends on who is doing the considering. I would recommend considering it dangerous myself, since you're asking the question at all. :-)

    Are there plumbing commands to manually change/rewrite the index? (see the case above where the objects are still there)

    Yes: git update-index is the one-entry-at-a-time updater (use --cacheinfo or --stdin to provide raw index-entry data rather than having it duplicate a lot of git add work). Many other commands update the index partially or en-masse as well.

    If you have a process by which you back up the index before a git checkout HEAD -- ... operation, you can read the entries out of the backup index (using GIT_INDEX_FILE=... git ls-files for instance) and then use git update-index, without having GIT_INDEX_FILE set, to put the information into the regular index. Of course, this being an index-overwrite-y operation, you might wish to first make another backup of the index.

    Is there an alternative way to checkout a single file without instantaneously staging it?

    No, but only because of the verb checkout here. To view the contents of a file that is in either the index, or in any commit—so that the contents have a name that git rev-parse can understand—use git show:

    git show :file          # file in index at stage zero
    git show :3:file        # file in index at stage three, during merge conflict
    git show HEAD:file      # file in current commit
    git show master~7:file  # file in commit 7 first-parent hops back from master
    

    Note also that git reset can overwrite one or many files in the index without touching the files in the work-tree:

    git reset HEAD -- file  # copy HEAD:file to :file leaving work-tree file undisturbed
    

    If you give git reset the path to a directory, it resets all the files that are already in the index and reside within the directory.