My MongoDB Sharded Cluster has 3 shards with each shard running on 3 replicas. To summarize:
Config Server:
shardcfg1.server.com:27018
shardcfg2.server.com:27018
shardcfg3.server.com:27018
Shard1:
shard11.server.com:27000 (P)
shard12.server.com:27000 (S)
shard13.server.com:27000 (S)
Shard2:
shard21.server.com:27000 (S)
shard22.server.com:27000 (STARTUP)
shard23.server.com:27000 (Unhealthy - invalidReplicaSetConfig: Our replica set configuration is invalid or does not include us)
Shard3:
shard31.server.com:27000 (S)
shard32.server.com:27000 (P)
shard33.server.com:27000 (S)
If you see the state above the problem lies in SHARD2
.
SHARD2
shard23.server.com
as not a memberThe secondary shard21.server.com
can be used to get the dump so potentially there is no data loss. However, I have no clue whatsoever about how do I stabilize the cluster again?
How would I remove the SHARD2
completely from the cluster? Or How should I reinitialize the shard with the same servers again?
One small detail that I missed which in turn came out to be the key for the solution: The cluster was managed by Mongo-MMS!
Solution:
So I had one secondary, another server in STARTUP mode and the third one that ridiculously declared itself as not part of the replica set! The entire cluster is managed by MMS. I did shut down all three of the servers. Now I just simply started the secondary available in standalone mode to get the backup of the entire database.
During this period I removed this shard from my cluster, the draining stuck because there was no primary in the shard. However, one odd thing happened and the automation agent on these servers was removed. Once the backup was complete, I started back the mongod
of the server which was secondary and had data on it.
The terminal sadly did show SECONDARY, however when I checked rs.status() it showed three servers, I did remember splicing off one of the rogue servers. That's when it struck me the MMS was managing the config of these replica set.
I quickly reconfigured with force flag as true after removing the rogue server. So now I have two servers, one in secondary and the other in startup mode. A few seconds after reconfiguration! Voila! The secondary promoted itself to the primary.
A long fight but glad to say never needed to restore the backup or rework the entire shard!