aerospikeaerospike-ce

data loss detection in aerospike


If we have 6 number of partition with replication factor 2 and with paxos-single-replica-limit 3 (once we are down to 3 nodes, replication factor becomes 1). And all of a sudden 3 nodes die because of cascading effect. It might so happen that few partition were not able to migrate. But as par this doc the cluster will continue as if nothing happened. In the case of strongly consistency mode the partition may go as dead partition and we have to manually revive it.

How can i know when there has been a data loss, so that i can backup from previous snapshot.

If it matters we are on community edition.


Solution

  • In the case of strongly consistent mode(requires enterprise license), there will not be any data loss. And if majority of cluster literally dies, the dead partition will need to be manually revived.

    In the absence of strongly consistent mode (default) one can grep for "rebalanced: expected-migrations" in aerospike logs of all live aerospike nodes. The result would look somewhat like below

    Jun 27 2022 19:11:22 GMT: INFO (partition): (partition_balance.c:928) {test} rebalanced: expected-migrations (0,0,0) fresh-partitions 0
    Jun 27 2022 19:18:13 GMT: INFO (partition): (partition_balance.c:928) {test2} rebalanced: expected-migrations (2325,1718,1978) fresh-partitions 0
    Jun 27 2022 19:18:13 GMT: INFO (partition): (partition_balance.c:928) {test} rebalanced: expected-migrations (2325,1718,1978) fresh-partitions 0
    Jun 27 2022 19:35:29 GMT: INFO (partition): (partition_balance.c:928) {test2} rebalanced: expected-migrations (514,50,50) fresh-partitions 0
    Jun 27 2022 19:35:29 GMT: INFO (partition): (partition_balance.c:928) {test} rebalanced: expected-migrations (0,0,0) fresh-partitions 0
    Jun 27 2022 19:58:18 GMT: INFO (partition): (partition_balance.c:928) {test2} rebalanced: expected-migrations (1941,1711,1293) fresh-partitions 0
    Jun 27 2022 19:58:18 GMT: INFO (partition): (partition_balance.c:928) {test} rebalanced: expected-migrations (1941,1711,1293) fresh-partitions 0
    Jun 27 2022 20:12:54 GMT: INFO (partition): (partition_balance.c:928) {test2} rebalanced: expected-migrations (1369,1089,1393) fresh-partitions 170
    Jun 27 2022 20:12:54 GMT: INFO (partition): (partition_balance.c:928) {test} rebalanced: expected-migrations (833,307,1245) fresh-partitions 0
    Jun 27 2022 20:19:07 GMT: INFO (partition): (partition_balance.c:928) {test2} rebalanced: expected-migrations (1467,1172,1576) fresh-partitions 190
    Jun 27 2022 20:19:07 GMT: INFO (partition): (partition_balance.c:928) {test} rebalanced: expected-migrations (385,418,770) fresh-partitions 0
    Jun 27 2022 20:19:59 GMT: INFO (partition): (partition_balance.c:928) {test2} rebalanced: expected-migrations (1830,1477,1926) fresh-partitions 128
    Jun 27 2022 20:19:59 GMT: INFO (partition): (partition_balance.c:928) {test} rebalanced: expected-migrations (581,614,1162) fresh-partitions 0
    

    look for fresh-partitions here. If it is more than 1 that means one partition is not available and aerospike created a fresh partition for you. If the other node has died that means there is a data loss. If other nodes come back again (because they did not die but got network partitioned) the older data will not be lost but conflict resolution will take place between older partition and freshly created partition(the default strategy of conflict resolution is generation number which means the key that got modified more often will be present after conflict resolution).

    source