gitgit-fast-import

What are the merge semantics of git fast-import streams?


I wrote, and maintain, an open-source tool called reposurgeon that edits version-control repository histories and can be used to move project histories between VCSes. Recently I shipped full support for reading Subversion dump files and repos. But there is one thing reposurgeon doesn't do very well yet, and that is translate Subversion branch merges done by copying to git-style DAG merges.

In order to get this part right, I need to understand the semantics of merge commits in a git fast-import stream much better than I do. My questions are about which version of content is supposed to be visible after a merge commit.

Of course, file modifications attached the merge commit make their content visible there. My questions are about paths not touched by the commit.

  1. If a path only has content on only one commit chain ancestral to the merge, I assume that content is supposed to be visible. Is that correct?

  2. If a path has content in more than one commit chain ancestral to the merge, which version will be visible?

  3. If a file is deleted along some paths to the merge, what rule predicts when it will be deleted in the merge revision?


Solution

  • if I understand your question, you're wondering exactly what shortcuts fast-import lets you take when streaming the contents of a commit into it.

    As far as I can tell from reading git/fast-import.c and the manual page, fast-import initializes the tree for a new commit from the tree that was provided in the "from" command. "filemodify" and friends begin from that state to construct the new tree that will be committed at the end.

    The fast-import command does not appear to change the tree at all when encountering "merge" commands; if you want to include changes from parents other than the first, you need to specify exactly which files you want to bring in. You can use marks or object hashes to name the other-branch files for "filemodify" though.


    edit: Ah, let's go deeper into the git model.

    In git, a commit points to a tree that represents the entire contents of the directory hierarchy being tracked, as it stood at the time of that commit. Commits do not carry any information about how they're different from their parents; the theory is that you can reconstruct the diff if you need it by comparing these trees.

    A merge commit is distinguished from non-merges only by the fact that it has two or more parents. It still has a single tree, recording exactly what's in the version that resulted from performing the merge. It still does not record anything about how its author combined the parents into a merged version. The git "porcelain" commands like git log and git diff do magic to reconstruct a useful description of what happened.

    Conceptually, to create a new commit object, you need to describe the complete mapping of paths to file contents that goes in that commit. (Much cleverness goes into making that efficient and simple instead of awful.)

    The git fast-import command provides a shortcut for the common case: Usually the VCS you're exporting from can tell you how this commit was formed as some kind of diff from the most recent commit on the same branch. In that case, you can effectively encode the diff into fast-import's stream format for a simpler and faster import.

    But you have to remember it's only a shortcut for re-constructing the entire tree from scratch.