gitmergemeld

during a git merge with meld, why can I modify LOCAL and REMOTE if only MERGED will be saved?


according to one of the answers to this question https://stackoverflow.com/a/18011273/5238559, LOCAL, BASE and REMOTE files will not be altered in the merge process, but only the resulting MERGED file.

during a merge in meld, I would modify the middle panel (BASE) by moving over code from left (LOCAL) and right (REMOTE). I understood that BASE will be a sort of "preview" for what the finally merged file will look like, but it wont be saved directly, which seems like a logical safety step.

however, I can also move code from BASE to LOCAL or REMOTE, and, when I close meld, I'll be asked to save the changes to all three files. why can I do this if only BASE (i.e. MERGED) is relevant to the merge process? what happens with the modifications in LOCAL and REMOTE?


Solution

  • TL;DR of the TL;DR

    Git doesn't use your working tree files except when you (or something) run(s) git add. Note that git mergetool runs git add on only one of the files that meld works with. So you can write as many extra files as you like. Git doesn't care. It only cares about that one particular file when meld is done.

    TL;DR

    Presumably you're running this merge tool meld via git mergetool. The way git mergetool works is ridiculously simple, once you understand how merge itself works, and that's why you can modify all these files: because they are all just files.

    For all this to make sense, you need to know how git merge works. This gets us into the distinctions between:

    The third one of these—your work-tree—is the only place that holds files that you can see. But—and this is very important—your work-tree is not in Git at all. It's just a place that Git sticks files into, so that you can see them and work on / with them. Later, git add will copy one of these files back into Git's index. If you use git mergetool to run a merge tool, the git mergetool code runs git add for you.

    The mergetool script runs git add on the merged file (by name) so whatever is in that file is what gets git added. Any remaining files are just junk as far as Git is concerned: they are simply untracked files. I believe mergetool should clean up the junk files (but should does not mean always will and opinions may differ on the should part too; there's a "keep backup" option here, which I have never used).

    Long

    You may be able to skip some sections below, depending on how familiar you are with Git. I will try to keep them short (by leaving a lot out) but they are still going to be long.

    More background on commits

    Each Git commit is given a unique number. These numbers are not simple counting numbers—we don't have commit #1 followed by #2, then #3, and so on. Instead, the numbers are random-looking, big, ugly hash IDs computed by a cryptographic hash function. These numbers are unique across all Git repositories everywhere (which is how Git manages the distributed nature of commits), but all we need to know here is that commits are numbered.

    Each commit holds two things. All parts of the commit are read-only, so these things are unchangeable, and are valid forever—or at least as long as the commit itself continues to exist:

    A merge commit is simply a commit that has at least two parent hash IDs in it. The git merge command often makes such a commit at the end: the first parent is the same parent that any ordinary non-merge commit would have, and the second parent is the hash ID of the commit that you just merged (e.g., the tip commit of a branch you merged by branch-name). The snapshot part of a merge is the same as any commit: it's just a full copy of every file as recorded in Git's index at the time the merge is completed.

    Git's index, and how it expands during merges

    Git's index has three names: Git calls it the index (as I am doing here), the staging area (for normal commits at least), and—rarely these days, mostly in flags like --cached—the cache. For normal, non-merge commits, I like to describe the index as holding your proposed next commit.

    What's in the index is—normally—a list of tuples: name, mode, and hash ID:

    Again, that's in the normal, non-merge case. These entries have a stage number (because the index is the "staging area") that is always zero. This is what makes them normal.

    When git merge starts, it expands the index. It replaces all the stage-zero entries, which represent the current commit–the index needs to match the current commit at the start of the merge operation—with stage 2 entries. This also opens up spaces for stage 1 and stage 3 entries. We'll come back to this below.

    Your working tree

    Both committed files—which are stored via blob hash IDs—and the index, which literally stores these same kinds of blob hash IDs, store the internal format versions of Git files, in which contents are compressed and de-duplicated, and maybe even delta-encoded. This format is suitable for archiving (because it's compressed and de-duplicated) but not for getting any actual work done. So Git has to extract such a file, from a commit or from Git's index, expanding out any compression.

    The result of extracting an archived blob object goes into an ordinary file. These files need to live somewhere, and that somewhere is your working tree. So git checkout or git switch works by copying files out, from a commit into Git's index—this part is fast and cheap as the index holds the files in the same format as the commit—and then to your working tree.

    The copying out to your working tree is slowish, but Git gets to cheat. Because the index keeps track of what's in your working tree, Git can usually tell very quickly if the working tree file is untouched from the last checkout. It can also tell, just by checking hash IDs, whether the file in the new commit you're checking out now is the same as the file in the old commit you had checked out before. If all goes well—and usually it does—Git can just leave the file alone, so it does.

    In principle, then, a git checkout of a different commit has to remove every old file (from Git's index and your working tree) and then fill in every new file from the new commit. Git just skips a lot of this work, which means a multi-megabyte or gigabyte checkout can take very little time (sometimes just a few milliseconds but this depends strongly on OS, caches, and other details, and also on the switch from commit X to commit Y not needing to change a lot of working tree files).

    Other than this, though, your working tree is just a regular old set of files and directories / folders (whichever term you prefer). Everything that works on your computer, works here. Aside from writing into it when you tell it—e.g., with git checkout—Git just lets you play with it to your heart's content. Then you can run git status, which only looks at it, or git add, which copies from it into Git's index. Until you do either of these, though, Git is completely hands-off.

    In short, your working tree is yours, to do with as you will. You can create files here that Git never needs to know about. As long as (a) you don't git add them and (b) they never come out of some existing commit, they never get into Git's index, and Git never knows about them. The git status command will whine about them, and you will need to list such files in .gitignore to make Git shut the bleep up, but other than that, they're quite irrelevant.

    Internals of the three-way merge

    When we run git merge, we quite typically are doing a three-way merge, which can have conflicts. To understand what's happening, let's look at a sample commit graph, i.e., a set of commits as found in some Git repository. Because the hash IDs of real commits are incomprehensible, we'll use single uppercase letters to stand in for them, like this:

              I--J   <-- branch1 (HEAD)
             /
    ...--G--H
             \
              K--L   <-- branch2
    

    I've added two branch names, branch1—which we currently have checked out, i.e., we're using commit J to fill Git's index and our working tree—and branch2, which selects commit L. The (HEAD) notation shows that we have branch1 checked out. All six listed commits are ordinary single-parent commits, so that as viewed from commit J—i.e., git log if we were to run it right now—we see, as history, commit J first, then commit I, then commit H, then commit G, and so on. As viewed from commit L—if we run git log branch2—we see commit L first, then K, then H, then G, and so on as before.

    These two commit histories meet up, when we go backwards like this, at commit H. So commit H is the merge base in this three-way merge.

    The goal of a merge is to combine work. We'd like to have Git figure out, on its own, what we changed since commit H. These are "our changes". We'd like to have Git figure out what they changed since commit H. These are "their changes". Git can in fact do this, using git diff:

    git diff --find-renames <hash-of-H> <hash-of-J>
    

    This will produce a list of each file we changed, and what lines need to be deleted and added to each of those files to turn the copies of those files that exist in commit H into the copies of those same files that exist in J.

    Similarly:

    git diff --find-renames <hash-of-H> <hash-of-L>
    

    will produce a list of files they changed, and what lines need to be modified in those files.

    If Git simply (simply?) combines these two lists and applies both sets of changes to the files taken from commit H, Git will arrive at a set of files that keeps our changes (H-to-J) and adds their changes (H-to-L). In many cases, some file we changed will have no changes on their side, and vice versa. These will be easy for Git. In some cases, some files will have changes on both sides. If those changes touch different lines, Git may be able to combine those changes on its own.

    These are the rules that Git uses, anyway. It just:

    The index now has three copies of each file, from merge base commit (BASE), --ours commit (LOCAL), and theirs (REMOTE). Each of these is really just a hash ID, for an internal Git blob object (well, plus the name and mode, with the staging number representing the slot).1

    Because of the de-duplication trick, if nobody made any changes to the file, all three staging slots will hold the same hash ID (and mode) and Git can just collapse all three index entries back down to a single slot-zero entry. If we changed the file, but they didn't, the base and their slot will have the same hash ID (and mode) and ours will differ and Git will just take our version of the file, moving slot 2 to slot zero and erasing slots 1 and 3. If they changed the file and we didn't, the base and our slot will have the same hash ID and theirs will differ and Git will just take their version of the file, moving slot 3 to slot zero, etc.

    This means we only ever have to work hard for files where both sides made changes (well, or for high-level / tree conflicts, which I'll skip over here). In this case, the various merge strategies that Git has today work by:

    The built-in low-level merge driver works on a line-by-line basis, using git diff on the individual files.2 For each diff-hunk you'd see in git diff output, it looks to see if the other side has touched the same lines, or lines that "touch" another change (e.g., if "our" diff adds a line at the end and "their" diff also adds a line at the end, Git has no idea which order to use when adding both sets of lines).3 It writes, to our working tree copy of the file in question, Git's best guess at the correct merge. If this all goes well—if Git is able to combine the two sets of changes without conflicts—Git then does an internal git add on the file. If not, Git leaves the conflicts in the working tree copy of the file, complete with conflict markers, and doesn't do an internal git add on the file.

    When the low level driver encounters something that is considered a conflict, if there is an extended-argument -X ours or -X theirs in effect, it will just take our change (from 1-vs-2) or their change (1-vs-3) according to the -X value, and not put in any conflict markers. So low-level conflicts can be resolved automatically in software using these flags. Note, though, that Git doesn't do anything smart here. It just picks the 1-vs-2 file diff, or the 1-vs-3 file diff, on the basis of a line-by-line diff hunk. But this does let Git run an internal git add on its own.

    When Git does run an internal git add, this simply takes the working tree copy of the file and copies it into slot zero, erasing slots 1 through 3 for that file. That marks the file as resolved. The index shrinks back to normal, for that one set of file entries. After all files have been processed, either there are some conflicts still showing in Git's index (because some file didn't get pre-collapsed and did not get git add-ed), or there aren't (all files got an easy index collapse, or got git add-ed after the low level driver did its thing).


    1The design here was supposed to allow more than one slot-1 entry when doing recursive merge, but that never went anywhere. It's not clear if it could go anywhere as there are some very tricky corner cases with files that don't exist in one or two of the three commits, and they get trickier if you allow this kind of thing.

    2There is, in the existing merge-recursive algorithm, a bunch of redundant work in both the high and low level code. The ongoing work to put in a new improved merge is eliminating a lot of this and will speed up a lot of the more difficult merges. This doesn't change the goal of the merge code, nor the high level description I'm giving here, but does shuffle the point at which some bits of work are done and results saved, or not saved, so that they can be done once instead of repeatedly.

    3A low level union merge, which Git doesn't support directly—but which you can get with git merge-file, used as a low-level merge driver that you write—assumes that line order is irrelevant, and can handle this without calling it a conflict.


    The upshot of all of this

    The description of what merge does with Git's index is pretty long, but if you've followed the logic all the way through, you will see that:

    So merge conflicts remain if and only if there are any nonzero stage numbers in Git's index. In this case, git merge stops, leaving behind a bunch of internal files—such as .git/MERGE_HEAD and .git/MERGE_MSG—to record that there's an ongoing merge. Meanwhile the index itself has some nonzero slot numbers, which record that there is a conflict.

    If the conflict was a low-level conflict, and we used Git's built in low-level merge driver on some file, the working tree copy of that file has conflict markers in it. These markers are derived from running the three original input files through the same code that git merge-file has available (so you could reconstruct the merge conflicts that way, but there's an easier way with git checkout -m or git restore -m at this point). Regardless of what's in the working tree copy of the file, the three input files exist in the index.

    If we now run git mergetool, this code rummages through the index (using git ls-files --stage or equivalent) to find the conflicted files. It then uses git checkout-index to extract the three files that were the inputs to the low-level merge driver. These get funky .gittemporary style names, which git mergetool renames to file_BASE, file_LOCAL, and file_REMOTE respectively (well, the exact naming pattern is tricky and this is just an approximation). For internal purposes, it copies the file to file_BACKUP. Then it runs your selected merge tool on these files (excluding the backup one).

    Your merge tool now works with working tree files. None of these files are in Git. You do whatever you like to them, using your merge tool. Whatever is in file, git mergetool assumes that's the result that you produced through use of the merge tool.

    Here, there's one more special trick:

    When git merge stops in the middle, your job is to clean up the mess, by writing into Git's index, at slot zero, the correct merge result. You can do this any way you like. My preferred method is generally just to open file in vim, after Git writes it with merge.conflictStyle set to diff3. I find most conflicts easy to resolve this way. In a few cases, I really do want to get the three versions, and for those cases, git mergetool is a way to do it—but having played with git mergetool, I haven't found it a particularly good way to do it. This is one of those user-preference deals, though.

    Anyway, once you have all the conflicts resolved, and have run git add to update Git's index, you should run:

    git merge --continue
    

    to tell Git to finish the merge. Git does not care how you resolved the conflicts. Git just cares that you put the right file into the index at staging slot zero, clearing out the other three staging slots.

    In the bad old days you had to run:

    git commit
    

    to finish the merge, and if you'd gotten confused (e.g., got interrupted, had cd'ed to some other repository, then had a meeting or something, and were now somewhere other than what you were thinking when you ran git commit) you could make an ordinary commit instead of finishing your merge. The --continue checks that there is in fact a merge to finish, then runs git commit to finish it.