[SOLVED] Zookeeper zookeeper.forceSync, Zab and Paxos

Zookeeper zookeeper.forceSync, Zab and Paxos

I noticed that the default the configuration for zookeeper.forceSync is "no".

This mean FileChannel.Force() will not be called on the Write ahead log.

I was under the impression that for the Zab consensus algorithm used by zookeeper to work correctly. That all entry had to be persisted to disk before responding to the leader. Otherwise during leader election some data might be lost.

Is it safe that the default is "no"?

Solution

No, it is not safe to run with forceSync=no.

As discussed on the zookeeper user list in 2014:

There's a big warning in the documentation that says that's a possibility. 
If you don't force both Java and the OS to flush their IO buffers to disk, 
then you have no guarantees that your data is consistent.

Looking at the zookeeper admin docs in 2020 it only says:

forceSync
(Java system property: zookeeper.forceSync)

Requires updates to be synced to media of the transaction log before
finishing processing the update. If this option is set to no, 
ZooKeeper will not require updates to be synced to the media.

You have to look above that at the section title and first paragraph of the section to be made aware of the danger:

Unsafe Options

The following options can be useful, but be careful when you use them. 
The risk of each is explained along with the explanation of what the 
variable does.

To me, that isn't a "big warning in the documentation" other than to say describe all settings in that section as "unsafe". There is no information about the risks of using the option. For example, I would expect solid documentation to mention the risk of correlated failures discussed below. You might consider raising a ticket that the documentation of that unsafe option isn't as clear and helpful as it should be.

The discussion in 2014 seems to indicate that at the time was that forceSync was enabled by default. They talk about a bug in forceSync=no where people were not seeing an expected performance improvements. That suggests that the default was safe and that you had to "opt-out of safety" by setting forceSync=no to increase performance. If the default is now forceSync=no I suggest that you file a bug.

The discussion on the forum goes onto suggest that for a single node failure there should be no issue. This will be the case as there will be no acknowledgements of writes in an ensemble under ZAB until the data is on at least two nodes. You would lose data in a crash of a single node if data isn’t flushed and isn’t replicated on more than one node.

The big problem when data isn't flushed to disk is when you have a correlated failure of nodes. If the disk hasn't flushed on at least one node a write acknowledged to the client can be lost. Zookeeper is often used for low volume meta-data such as leader election results. The whole selling point of Zookeeper is that it is supposed to be safe to hold such critical data. Forgetting who won a leadership election could cause huge damage to a system that relies upon Zookeeper for safety but doesn't actually write data into Zookeeper. In such low throughput write use cases were Zookeeper is typically deployed people often can, and so should, run safely with forceSync=yes.

As the 2014 thread highlights, there are approaches that can be taken to try to make not forcing the disk safe. Battery-backed raid controllers are mentioned which I recall having on physical servers at the turn of the century. We also had teamed power supplies in every server connected into two external power grids, redundant network cards, redundant generators to retore power etc. You might then put the servers running the ensemble into different racks so all network traffic can go via multiple top-of-rack switches. That would reduce the risk of correlated failures to be negligible. At which point you can sleep at night and only worry about accidentally running a deployment that killed multiple processes at the same time. Perhaps that is what the documentation means about "being careful" when using that option.

The problem is that in the last three global companies I have worked at that deployed Zookeeper they were on stock VMs without any guarantees provided about rack location or anti-affinity or correlated failures. There was no guarantee that if I set up five VMs to host a Zookeeper ensemble they wouldn't all be allocated to the same physical host. Also during any future hardware upgrade the infra teams reserved the right to move VMs around physical hosts without consulting us. When VMs are on the same host then you have very little protection against correlated failures. I simply wouldn't sleep at night with forceSync=no without being sure that bulletproof measures were in place to ensure that correlated failures couldn't happen.