mongodb database-design nosql database-versioning

Implement draft and publication system in MongoDB

I have a mongodb database with collection documents that are approximately as follows:

// book document
{
    _id: $oid,
    userId: "..."
    name: "name",
    description: "description"
    status: "draft"
    // ...
}

// page document
{
    _id: $oid,
    bookId: "..."
    name: "name",
    description: "description"
    // ...
}

A book can have thousands of pages. Each book has a status property which can have one of the statuses "draft", "published" or "archived".

A "draft" book can be created from any other book, e.g. with "draft", "published" or "archived" status.

Every time a new draft is created, a new book document with all the same pages as the original book are created.

Any "draft" book and all its pages can be edited independently of the original book and its pages. This means that any editing on the new draft book or any of its pages does not affect the original book or its pages.

In the design, priority should be given to book query performance over draft creation performance, i.e. querying a book with all its (paginated) pages is the top performance priority.

What is the best approach to achieve the requirements above? Any modification to the database schema is possible at this stage.

These are options that I considered:

Clone the entire book with all its pages any time a new draft is created off an existing book. This would provide clean drafts that can be independently edited in their entirety and queried in the fastest way possible with a simple find. The downside is that the size of the page collection would grow rapidly with the number of drafts and most of the pages in cloned books would be the exact same version as in the original book. Also, draft creation time would grow proportionally with the number of pages in a book.
Add a parentId property to the book document to refer to the original book and reuse all un-modified pages from the parent. Any time a page is edited, a new page is cloned with bookId set as the id of the new draft. This approach would reuse most of the page documents but querying a draft book would be much more complex as we would need to follow the parentId chain to find all modified pages and all pages belonging to all the parent books (drafts can be created multiple times off different drafts). Draft creation would be very fast (just cloning one book) but query times would grow proportionally with the number of parents of a draft. Also, when querying a draft where pages have been edited in different parents, we would need to filter out all old page versions and keep only the newest version in the chain.

Solution

Behaviour of the 2 approaches you described are quite different. In the first (full clone) case, changes in original pages apply to original book only. In the second (hierarchical) case, changes in original pages are automatically propagated to all derivative drafts. I wouldn't question which one is more performant, as it is like comparing apples to oranges.

Consider a 3rd option (copy-on-change) - store references to the pages in a book:

// book document
{
    _id: $oid,
    userId: "..."
    name: "name",
    description: "description"
    status: "draft"
    // ...
    pages: [
        $oid1,
        $oid2,
        $oid3,
        ....
    ]
}

Creating a new draft is as easy as cloning a single book document. Querying a book with pages is a trivial lookup aggregation.

The most expensive operation will be changes in the pages and may require a 2 phase commit if you need strong data integrity.

With this approach you can choose either to propagate changes to derivative drafts, or keep it local. If later, you may need an additional housekeeping step to remove orphaned pages.