I am designing the schema for a geographically distributed application that would be spread across different countries and cities. There are related collections such as -
Shops (spread across countries and cities) 15 days transactions for the shops (rest goes to historical store) etc.
Is it possible to ensure that shops and the transactions of the shops are co-located in the same shard? Currently in the transactions collection, say I am storing just the unique _id of shop as reference.
Suppose I shard the Shop collection with a key such as {region, country, city, shop_id}. Do I have to store the same columns/attributes for the transactions table - i.e. region, country, city, shop_id instead of just the shop_id and then choose a shard key like - {region, country, city, shop_id, tx_id} to ensure that it is placed in the same shard as the Shop collection?
In other words if a 'child' collection has records logically related to records of a 'parent' collection, then should the entire shard key that we apply to the 'parent' collection have to be a part of the shard key of the 'child' to ensure that they are co-located on the same shard?
Thanks and Regards, Archanaa Panda
Here is how I would do it. While there are valid use cases to have one or more shards per country, usually you want to do a geographical distribution based on regions. The usual ones – EMEA, APAC and NCSA – should suffice for this and I will use this in my example. As a side note: You might want to split EMEA into EME (data center in Europe) and A (I have good experiences with data centers in Sout Africa), since the connection from and to Africa sometimes is... not optimal.
As for the shards, I'll assume three shards named shard0
(EMEA) , shard1
(APAC) and shard3
(NCSA).
As you found out, it is not too important to have the data for each region on a single shard (which would not be very scalable), but on shards with the same tag. I strongly suggest to have all shards with the same tag and their front ends hosted in the same data center: Internal traffic tends to be free of charge and the internal bandwidth usually is much higher (or at least can be upgraded to be higher) than external bandwidth.
Set up your shards. Since speed seems to be an issue for you: use SSDs. They are so much faster that they will easily save you multiple shards, as spinning disks pretty usually become the limiting factor.
Tag the shards:
sh.addShardTag("shard0", "EMEA")
sh.addShardTag("shard1", "APAC")
sh.addShardTag("shard2", "NCSA")
Add the tag ranges. In case you have the region for each of your shops, there is no need to do it by country. Simply tag by region:
sh.addTagRange("commerce.shops",{"region":"EMEA"},{"region":"EMEA"},"EMEA")
sh.addTagRange("commerce.shops",{"region":"APAC"},{"region":"APAC"},"APAC")
sh.addTagRange("commerce.shops",{"region":"NCSA"}, {"region":"NCSA"},"NCSA")
The reason why I wouldn't assign country codes is that you would have to assign each country code to a tag. And since you already have the region, why not use it to make your life easier?
As a shard key, I'd use a compound key of region and whatever deems you appropriate. Note that you should not use ObjectId as the other component of the shard key, as it is monotonically increasing, which will lead to problems if you have more than one shard per region. Lets say your ShopId is the other part of the shard key and it is an ObjectId, there is a workaround: Using a hashed shard key.
sh.shardCollection("commerce.shops",{"region":1,"shopId":"hashed"})
This way, all documents will be distributed to the shards responsible for the respective region while still allowing chunks to be distributed amongst them.
hth.