I've been experimenting with Aeron cluster, and one thing that is unclear to me is how do you deal with applications where nodes have 10s of gigabytes of state... this state is in memory and is accumulated by playing the events.
However if I initiate a snapshot (only can on leader) this will obviously block since you can't keep applying events and take snapshot at the same time... for latency critical apps obviously you can't wait for seconds while snapshot is taken.
One solution that comes to mind is that follower can take a snapshot and when it's done catch up with master and then take over, when snapshot is taken and log is in right state you know your snapshot is valid. This way you have seconds to take your snapshot.
Or you're leader when it tries to take a snapshot hands over leader to a follower that is the most up to date, takes the snapshot then if needed can take over master again... no blocking your clients.
Am I doing something wrong, or misunderstanding the snapshots?
There is not much info on this amazing library. At least I couldn't find an answer to this.
There is an open issue on this feature: https://github.com/real-logic/aeron/issues/1263