Current setting, cassandra 2.2.5, gossip is 1 second default and phi threshold value is 8. The problem, I am facing is spikes in hints. And one of the reason hints goes up is when node is marked down (gossip has not communicated for phi threshold value).
I read one article, where it say phi threshold value of 8 corresponds to 18 seconds, it will be few seconds here or there. Now I need to understand what is the reason, what is blocking gossip to communicate for 18 seconds. What is the checklist that need to be satisfied for gossip to communicate?
Re: "How does cassandra gossip protocol and phi_threshold works?": Phi is approximated as: phi = (tnow - tLast) / mean
and a node is marked down when phi > phi_threshold / 0.434
. For your settings (and assuming a mean of 1 [as in the node usually receives the heartbeat 1 second apart]) a node will be marked down if we didn't receive any heartbeats from it for 8 / 0.434 = 18.42
seconds.
The paper documenting the algorithm can be found here.
Re: "What is the checklist that needs to be satisfied for gossip to communicate?": to me there are a few things: