gitgit-pullgit-flow

Will a "git-pull develop" fetch all the commits reacheable from develop?


I have a question regarding how git will pull the changes form the remote, and how many history.

I'm considering to follow a gitFlow workflow for my project. We are 80 developers, and we will be integrating our changes from feature branches to the develop branch - by means of pull requests to perform code review first.

We will need to (locally) rebase our feature branches on (top of) develop, so that we have all the latest develop changes integrated. Hence, we will be pulling develop often. Here, I don´t want to fetch other teammates' feature branches - nor their commit history.

Now, if I pull develop, will this operation bring commit history that happen under other feature branches if they are reachable (through a merge commit) from develop?

Thanks in advance :-)

EDIT: I might not have been clear enough:

  1. We use rebase locally, so that pull request over develop branch are mergeable. We don't use merge as it might "pollute" feature branches when performing code-review. If the pulll request is accepted then, we will merge with a non fast forward commit.

  2. I know I can "git fetch origin develop". Here is my question: will git pull origin develop just "fetch" the blue commits or also the green ones? See figure git-pull-


Solution

  • I started on a complete answer, but it got way too long.

    To answer just a few specifics, your concerns are real but slightly misguided (not your fault as much Git documentation is terrible). The crucial issue is not so much what git fetch fetches,1 it's what is in the commit graph of the commits you merge with git merge; and which commits get copied when you choose to run git rebase, which depend, again, on the commit graph, and on the arguments you supply to git rebase.

    The key concept is reachability. Names like origin/master (which git fetch updates) make commits reachable, but commits (which git fetch brings in) also make other commits reachable. A reachable commit makes the entire chain of commits "before" that commit reachable. Merge commits, which list more than one parent commit ID, make two (or more) chains of commits reachable.


    1Of course, what git fetch doesn't fetch, can't possibly be reached (in your copy of the repo), since it does not exist (in your copy of the repo). I suspect that's what you are aiming for here, but it's difficult to achieve in general, and unnecessary anyway.


    Remember that (1) each commit is identified by its SHA-1 hash ID, (2) each commit contains the hash ID(s) of its parent commit(s), and (3) branch names are just names for one commit ID. The branch name gets a new ID stuffed into it frequently, to grow the branch (to add a regular or merge commit), or to point to commits copied by rebase.

    Then, remember that git rebase works by copying commits. The copies have new, different IDs:

              A--B--C       [original mybranch, before rebase]
             /
    ...--o--o
             \
              o--o           <-- origin/theirbranch
                  \
                   A'-B'-C'   <-- mybranch [after rebase]
    

    This is guaranteed to be fine as long as no one else has names (branch or tag names) or commits that point to any of the original commits A, B, or C. If they do have such names, those existing names may—or may not—continue to point back to the originals, not to the new copies. Even that is fine as long as you don't use them now. If and when the names are updated to point to new commits, the old ones become irrelevant as long as no still-reachable commits point to the old commits. If existing commits point to "outdated" commits, though, those commits will continue to point to them forever, since commits are permanent.2


    2No Git object can ever change. This is a fundamental guarantee that Git makes. However, all Git objects, including commits, that are completely unreachable are eventually removed. Git has a "garbage collector", git gc, that does this. It's a bit complicated as there are numerous grace period tricks to keep objects around: everything gets 14 days by default, and references—including branch, tag, and remote-tracking branch names—may have reflog entries, which make otherwise-unreachable commits reachable again. The reflog entries themselves persist for either 30 days or 90 days by default, depending on yet another reachability computation, comparing the current hash value in the reference to the hash in the reflog entry. The garbage collector is normally invoked automatically whenever Git thinks this might be a good idea.


    On fetch

    For instance, suppose that your git fetch brings in, to your repository, origin/BobsBranch and it points to some commits:

              B1-B2-B3    <-- origin/BobsBranch
             /
    ...--o--o             <-- origin/develop
             \
              C1-C2-C3    <-- my_independent_work
    

    You can rebase your work whenever you like. Meanwhile Bob can rebase BobsBranch (though he may need to force-push the result to the server). Let's say he throws out those three commits entirely in favor of one new B4 commit. You run git fetch and pick up a new, different origin/BobsBranch; your repository now has:

              B4          <-- origin/BobsBranch
             /
            | B1-B2-B3    [a reflog entry for origin/BobsBranch]
            |/
    ...--o--o             <-- origin/develop
             \
              C1-C2-C3    <-- my_independent_work
    

    The reflog-only commits won't show up in git log --all or gitk --all views, and as long as you never use any of these B* commits, they do not harm you in any way (well, they do take up a bit of space in your repository).

    To avoid bringing them over even though they are harmless, you can run git fetch with instructions to avoid bringing them over. When you run the git pull convenience command, git pull runs git fetch with instructions to bring over only one origin/whatever branch's reachable commits, so that usually avoids bringing them over—unless, of course, they're reachable from something your Git does need, based on the one branch tip.

    On merge

    A "bad" case occurs when you merge in a commit that "reaches" a commit that is later copied by rebase. For instance, suppose you have this:

    ...--o--o--A--B   <-- origin/feature_X
             \
              C--D    <-- feature_Y
    

    Now you decide it is time to merge origin/feature_X's commits (A and B) into your feature_Y, so you make a merge commit:

    ...--o--o--A--B   <-- origin/feature_X
             \     \
              C--D--o   <-- feature_Y
    

    If someone else (upstream) decides to rebase and force-push their feature_X, so that your origin/feature_X points to new copies, you end up with this:

              o--A'-B'  <-- origin/feature_X
             /
    ...--o--o--A--B
             \     \
              C--D--o   <-- feature_Y
    

    That can happen even if there was no name attached to the rebase-copied commits, if you picked up something else by its name. For instance, if someone else pushed feature_F and promised it was done:

           A----B
          /      \
    ...--o--o--E--F   <-- origin/feature_F
             \
              C--D    <-- feature_Y
    

    and you then merge it, you get this:

           A----B
          /      \
    ...--o--o--E--F   <-- origin/feature_F
             \     \
              C--D--o   <-- feature_Y
    

    Now suppose they, or a third person, then rebase a branch they have that points to B, without realizing / remembering that commit F itself also points to B. That is, they start with this (note that they do not have your feature_Y):

           A----B     <-- myhacks
          /      \
    ...--o--o--E--F   <-- feature_F, origin/feature_F
    

    Then then decide that it would be better to rebase myhacks onto commit E, so they run:

    $ git checkout myhacks
    $ git rebase 123e4567    # <id-of-E>
    

    which produces:

           A----B
          /      \
    ...--o--o--E--F      <-- feature_F, origin/feature_F
                \
                 A'-B'   <-- myhacks
    

    Eventually, when you fetch (perhaps via git pull) and get their final version of myhacks—whether or not it has a name at that time, as long as it has commits A' and B'—you will have (and retain) the original A--B commits, through commit F, and add the A'-B' chain, even though you may never have seen their branch-name myhacks.

    Conclusion

    The "bad" case we saw above happened when git fetch brought in commit F, via the name (in the repository you're fetching from, presumably one stored on a central server) feature_F. (You and your Git renamed this origin/feature_F.) The problem was not feature_F (or origin/feature_F) itself, though, but rather myhacks: a name neither you, nor the central server, ever saw! The person who did have that name—or maybe even made it up after the fact—used it to copy commits A and B, without thinking about who had the originals. He then pushed the copies, maybe under yet another name.

    The names matter at fetch and push time because git fetch and git push transfer commits by refspecs (mostly just pairs of reference names, plus some ancillary stuff). Before and after that point, though, the names are mainly distractions: it's the set of commits, as named by their IDs, and their reachability status, that matters.