I have the following schema for posts. Each post has an embedded author and attachments (array of links / videos / photos etc).
{
"content": "Pixable tempts Everpix users with quick-import tool for photos ahead of December 15 closure http:\/\/t.co\/tbsSrVYneK by @psawers",
"author": {
"username": "TheNextWeb",
"id": "10876852",
"name": "The Next Web",
"photo": "https:\/\/pbs.twimg.com\/profile_images\/378800000147133877\/895fa7d3daeed8d32b7c089d9b3e976e_bigger.png",
"url": "https:\/\/twitter.com\/account\/redirect_by_id?id=10876852",
"description": "",
"serviceName": "twitter"
},
"attachments": [
{
"title": "Pixable tempts Everpix users with quick-import tool for photos ahead of December 15 closure",
"description": "Pixable, the SingTel-owned company that organizes your social photos in smart ways, has announced a quick-import tool for Everpix users following the company's decision to close ...",
"url": "http:\/\/t.co\/tbsSrVYneK",
"type": "link",
"photo": "http:\/\/cdn1.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2013\/09\/camera1-.jpg"
}
]
}
Posts are read often (we have a view with 4 tabs, each tab requires 24 posts to be shown). Currently we are indexing these lists in Redis, so querying 4x24posts is as simple as fetching the lists from Redis (returns a list of mongo ids) and querying posts with the ids.
Updates on the embedded author happen rarely (for example when the author changes his picture). The updates do not have to be instantaneous or even fast.
We're wondering if we should split up the author and the post into two different collections. So a post would have a reference to its author, instead of an embedded / duplicated author. Is a normalized data state preferred here (author is duplicated for every post, resulting in a lot of duplicated data / extra bytes)? Or should we continue with the de-normalized state?
As it seems that you have a few magnitudes more reads than writes, it probably makes little sense to split this data out into two collections. Especially with few updates, and you needing almost all author information while showing posts one query is going to be faster than two. You also get data locality so potentially you would need less data in memory as well, which should provide another benefit.
However, you can only really find out by benchmarking this with the amount of data that you'd be using in production.