distributed-transactions2phase-commit

How does three-phase commit avoid blocking?


I am trying to understand how three-phase commit avoids blocking

Consider the following two failure scenarios:

Scenario 1: In phase 2 the coordinator sends preCommit messages to all cohorts and has gotten an ack from all except cohort A. Network problems prevent cohort A from receiving the coordinator's preCommit message. Cohort A times out waiting for the preCommit message and chooses to abort. Then both the coordinator and cohort A crash.

Scenario 2: The protocol reaches phase 3. The coordinator sends a doCommit message to cohort A. But before it can send more doCommit messages the coordinator crashes. Cohort A commits its part of the transaction then crashes.

As far as I can tell the remaining cohorts have the exact same state at the end of scenario 1 and scenario 2. So when a recovery coordinator steps in how can it find out from the remaining cohorts whether we are in scenario 1 and abort or we are in scenario 2 and commit and thus avoid blocking?


Solution

  • Three-phase commit isn't magic; it's just more resilient than two-phase commit. In particular, 3PC is resilient against single-point failure, but not all kinds of multiple-point failure. Both scenarios in the question posit two-point failures. In other words, the premise of the question is misguided; it's asking more of 3PC than it's capable of.

    For further reading, here's a sentence from the abstract of a paper on the subject, Analysis and Verification of Two-Phase Commit & Three-Phase Commit Protocols, by Muhammad Atif, to whet your appetite:

    We also apply our method to its “amended” variant, the Three-Phase Commit Protocol (3PC) and prove it to be erroneous for simultaneous site failures

    I found this paper to provide a foothold into the literature. There's no small amount of it on this subject, if you want to delve in.