mongodbazureazure-cosmosdbazure-cosmosdb-mongoapiazure-cosmosdb-changefeed

How reliable is change stream support in Azure Cosmos DB’s API for MongoDB?


Description

I am working on an ASP.NET Core 3.1 web application which needs to track/respond on changes made to the MongoDB database hosted by Azure Cosmos DB (version 3.6). For this purpose I am using the Change feed support.

The changes are pretty frequent: ~10 updates per second on a single entry in a collection.

In order to track down changes made on the collection, I am dumping the affected entries to a file (this is just for testing purposes) with the following piece of code.

private async Task HandleChangeStreamAsync<T>(IMongoCollection<T> coll, StreamWriter file, CancellationToken cancellationToken = default)
{
    var pipeline = new EmptyPipelineDefinition<ChangeStreamDocument<T>>()
            .Match(change => change.OperationType == ChangeStreamOperationType.Insert || 
                             change.OperationType == ChangeStreamOperationType.Update || 
                             change.OperationType == ChangeStreamOperationType.Replace)
            .AppendStage<ChangeStreamDocument<T>, ChangeStreamDocument<T>, ChangeStreamOutputWrapper<T>>(
                  "{ $project: { '_id': 1, 'fullDocument': 1, 'ns': 1, 'documentKey': 1 }}");

    var options = new ChangeStreamOptions
    {
        FullDocument = ChangeStreamFullDocumentOption.UpdateLookup
    };

    using (var cursor = await coll.WatchAsync(pipeline, options, cancellationToken))
    {
        await cursor.ForEachAsync(async change =>
        {
            var json = change.fullDocument.ToJson(new JsonWriterSettings { Indent = true });
            await file.WriteLineAsync(json);
        }, cancellationToken);
    }
}

Issue

While observing the output, I have noticed that the change feed was not triggered for every update that was made to the collection. I can confirm this by comparing the output generated against the database hosted by MongoDB Cloud.

Questions

  1. How reliable is change stream support in Azure Cosmos DB’s API for MongoDB?

  2. Can the API guarantee that the most recent update will always be available?

  3. I was not able to process the 'oplog.rs' collection of the 'local' database on my own, does the API support this in any way? Is this even encouraged?

  4. Is the collection throughput (RU/s) in some way related to the change event frequency?

Final thoughts

My understanding is that frequent updates throttle the system and the change feed simply does not handle all of the events from the log (rather scans it periodically). However, I am wondering how safe it is to rely on such mechanism and be sure not to miss any critical updates made to the database.

If change feed support cannot make any guarantees regarding event handling frequency and there is no way to process 'oplog.rs', the only option seems to be periodic polling of the database.

Correct me if I am wrong, but switching to polling would greatly affect the performance and would lead to a solution which is not scalable.


Solution

  • I suspect that the MongoDB change stream is built on the Cosmos DB Change Feed. My experience is entirely with the Cosmos DB change feed; I haven't used the MongoDB API at all. So this answer is all assuming that the MongoDB change stream internally uses the Cosmos DB Change Feed, which makes sense, but I could be wrong.

    How reliable is change stream support in Azure Cosmos DB’s API for MongoDB?

    It's fully reliable, but has some limitations.

    One of the change feed limitations is that it can "batch" updates. Internally, the change feed processor polls the change feed, and it will get all items that have changed. However, if an item changes multiple times between polls, it will only show up in the change feed once. This is the behavior of the Cosmos DB SQL API Change Feed, and I expect the same limitation applies to the MongoDB change stream, though I don't see it actually documented anywhere in the MongoDB docs.

    Another limitation is that deletes are not observed.

    Because of these limitations, the change feed / change stream is not an event sourcing solution. If you want event sourcing, then you'll need to model your data as events yourself; there's nothing built-in that will do that for you.

    That said, within these limitations, it's fully reliable in the sense that your code will receive every changed document in the change feed. The limitations just mean that multiple updates may come across as a single changed document, and deleted documents do not come across at all.

    Can the API guarantee that the most recent update will always be available?

    There's always the chance that the document has changed after your code retrieved the document from the change feed, in which case the updated document will be re-published to the change feed and your code will see it again in a bit. There's no guarantee (of course) that the document your code just got from the change feed is the same as what's in the db, but it will be eventually consistent.

    I was not able to process the 'oplog.rs' collection of the 'local' database on my own, does the API support this in any way? Is this even encouraged?

    ¯\(ツ)

    Is the collection throughput (RU/s) in some way related to the change event frequency?

    Yes. The change feed itself is built-in to Cosmos DB, but the change feed processing has an RU cost. RUs are used by the change feed processor to poll the change feed, read documents from the change feed, and also update its "bookmark" to keep track of where in the change feed it is.

    My understanding is that frequent updates throttle the system and the change feed simply does not handle all of the events from the log (rather scans it periodically).

    That is correct.

    However, I am wondering how safe it is to rely on such mechanism and be sure not to miss any critical updates made to the database.

    The code will always (eventually) receive the updated documents. However, if you need to see each change individually, then you will need to structure your data using something like event sourcing. If your app only cares about the final state of the documents, then the change feed is fine. But if, e.g., you need to know if someCriticalProperty was set to true and then back to false, then you'll need event sourcing.

    switching to polling would greatly affect the performance and would lead to a solution which is not scalable.

    Polling isn't necessarily bad. The change feed processor uses polling, as described above. It also has a neat mechanism to allow scale-out, where different processors watching the same collection can split up the documents between them (by partition key); I'm not sure if/how this would translate to the MongoDB world, but it's a pretty elegant solution for scaling SQL API change feed processors and works quite nicely with Azure Functions (unfortunately, there's no MongoDB change stream trigger for Azure Functions).