I wonder is there is a recommendation policies or rules on how storages that implement AP with eventual consistency in CAP theorem should behave when network partition occurs and restores. AP means DBMS allows clients to do writes while replica node is disconnected from master. However when the network is back up and running, what happens to replica server's data that got written during the network disruption between the replica and leader/master?
As an example:
Person
address happened to be updated in both leader and disconnected from the leader replica by conflicting data. Will replica's data to be replaced?In the case of MySQL, data change events are written to a log on the source instance. The replica subscribes to this log remotely, so it downloads new events very promptly after they are committed to the log. Once it downloads that portion of the log, it can replay the events against its own instance of the data.
As long as the source instance doesn't purge the logs too quickly, the network can be interrupted, or the replica can be rebooted or whatever, and it will just pick up where it left off when it reconnects. Of course the replica must keep track of its progress in both downloading and replaying those events.
I have been supporting replica database instances that were offline for too long. For example if a piece of hardware in the replica's server needs to be replaced and it takes a few days to get it shipped, then the logs on the source are long gone by the time the replica finally tries to download them.
The replica has integrity only if it replays all the events, in the correct order. If it can't do that because the logs have expired, then the replica cannot recover. In that case, we reinitialize the replica by dropping all its data, get a new snapshot of the whole dataset from the source, and start again from that point. We had to do this several times a week, because we were running thousands of replicas.
Re your comments:
Do you do reinitializing the replica manually or the entire process (drop data, restoring snapshot etc) could be done automatically?
It's difficult to make one solution that will work in everyone's environment.
At my last job, we developed our own automation to do this, so we could run one command to launch a background request to rebuild. It's the same steps that one would do manually, but we implemented them in code that would run in an automation service. That way we don't forget a step, we don't rely on anyone's laptop or interactive session, the automation has good quality assurance, it can do retries, can run multiple jobs in parallel, can report progress. But the solution was designed very custom for the environment we worked in. We had to comply with server naming conventions, network security policies, monitoring and alerting systems, etc.
MySQL Clone plugin is a recent new technology that claims to do a similar thing, but I haven't had an opportunity to try it.
MySql does not allow to write to replica in any configuration, does it?
By default, MySQL does allow changes directly on a replica. Of course this can compromise integrity of the replica if you aren't careful. So in many environments, it's a good practice to set configuration options read_only=ON
and super_read_only=ON
on the replica to prevent it.
what does happen to the replica's data, updated and inserted while disconnected from the source, when connection is restored and replica is able to sync up to the source? I wonder if all the writes to replica are going to be lost
MySQL replication doesn't do a full re-sync. It doesn't have any way of assuring that the replica is in sync with its source. Replication only replays incremental change events. This works if the two instances started from the same state, and all the changes are deterministic, so then they should stay in sync, in theory.
In practice, there's a nonzero risk of "data drift" — i.e. the replica's data diverges from the data on its source. MySQL's replication solution cannot detect data drift. There are third-party tools like pt-table-checksum that try to do this, but they aren't 100% perfect at detecting differences.
To minimize the risk: