amazon-s3microservicesamazon-snscqrsevent-sourcing

Microservices + CQRS implementation


I am working on implementing a microservice architecture using the CQRS pattern. I have a working implementation using API Gateway, Lambda and DynamoDB with one exception - the event sourcing.

Event Sourcing has the applications publishing a notification to an event stream that other services in the platform can consume. This notification represents an event that took place as part of the originating HTTP request. For instance, if the user makes a HTTP POST with a complete "check patient into hospital" model then the Lambda will break that apart and publish multiple events in sequential order.

Patient Checked in (includes Patient Id, hospital id + visit id)

Room Assigned (includes room number, + visit id)

Patient tested (includes tested + visit id)

Patient checked-out (visit id)

The intent for this pattern is to provide an audit trail of all events that took place while the patient was in the hospital. This example (not what I'm actually building) would be stored in an event source that can be replayed at any time. If the VisitId was deleted across all services we could just replay the events one at a time, in order, and reproduce an exact copy of the original record. You consider all records immutable to achieve this. Each POST would push into the event source and then land in the database that would pull the data out during a HTTP GET request. It would also have subscribers that would take pieces of this data and do other things - such as a "Visit Survey" service that would listen to the Patient Checked Out event and prep a post-op survey.

I've looked at several AWS services to provide this. I know about Kinesis Data Streams but I don't like the pricing structure nor do I want to deal with shards (no autoscaling). Since my entire platform is built on consumption based pricing (Dynamo, Lambda etc) I want to keep my event source the same way. This makes it easier for me to estimate a per-user cost as I just do math based on estimated requests per month, per user.

I've been using SNS for the stream itself, delivering the notifications, and it's been great. Super fast and not had any major issues while developing it. The issue though is that this is not suitable for a replay store - only delivery of the event messages. For a replay store I thought Kinesis Firehose made a lot of sense... Send it to S3 + SNS at the same time. Turns out SNS isn't a delivery destination available. I can Put to S3 myself and then publish to SNS but that seems like duplicate work in the code base when I can setup an S3 trigger to fire a Lambda and just have another small Lambda that reacts to the Event landing in S3 and do the insert into the DynamoDB. I've seen that this can be much slower though than just publishing through SNS. I'm also not sure about retry policies on the Put event. This simplifies retries though as I can just re-use the code in the triggered Lambda to replay all events in a bucket path.

I could just PutObject and then Publish to SNS within the same HTTP POST Lambda. If the SNS Publish fails though then I now have an object in S3 that was never published. I'd have to write a different Lambda to handle the fixing and publishing. Not the end of the world - either-way I have two Lambdas to deploy. I'm just not sure which way makes more sense in this pattern with AWS services.

Has anyone done something similar and have any recommendations? Am I working my way into a technical hole that will be difficult to manage later? I'm open to other paths as well if I can keep it to a consumption based pricing model. Thanks!


Solution

  • Event Sourcing has the applications publishing a notification to an event stream that other services in the platform can consume.

    You'll want to be a little bit careful here -- there are at least two different definitions of "event sourcing" running around.

    If you care about event sourcing, in the sense usually coupled with CQRS (Greg Young, et al), then your events are your book of record. The important complication this introduces is that your service needs to be able to lock the "event stream" when making changes to it (without that lock, you run into "lost edit" scenarios and have to clean up the mess).

    So the "pointer to your current changes" needs to live in something that has transactions. DynamoDB should be fine for this (based on my memory of the event sourcing break out room at re:Invent 2017). In theory, you could have the lock in dynamo, which contains a pointer to an immutable document stored in S3. I haven't been able to persuade myself that the trade offs justify the complexity, but as best I can tell there's nothing in that architecture that violates physics and causality.

    If your operations team isn't happy with Dynamo, another reasonable option is RDS; choose your preferred relational data engine, deploy an event storage schema to it, and off you go.

    As for the pub sub part, I believe you to be on the right track with SNS. It's the right choice for "fanning out" messages from a publisher to multiple consumers. Yes, it doesn't support replay, but that's fine -- replay can happen by pulling events from the book of record. See the later parts of Greg Young's Polyglot Data talk. Yes, sometimes you will get messages on both the push channel and the pull channel, but that's fine; you already signed up for idempotent message handling when you decided a distributed architecture was a good idea.

    Edit

    Why the need to store a pointer in DynamoDB?

    Because S3 doesn't offer you any locking; which means that on the unhappy path, where two copies of your logic are trying to write different versions of your data, you end up victim to the lost edit problem.

    You could manage the situation with optimistic locking - something analogous to HTTP's conditional PUT; but S3 (last time I checked) doesn't support conditional modification.

    You could use S3 as an object store for immutable documents, but now you need some mechanism to determine which document in S3 is the "current" one. If you try to implement that in S3, you run into the same lost edit problem all over again.

    So you need a different tool to handle that part of the problem; some tool that is suitable for "state succession". So DynamoDB fits there.

    If you are using DynamoDB for locking, can you also use it for event storage? I don't have enough laps to feel confident that I know the answer there. For small problems, I'm mostly confident that the answer is yes. For large problems...?

    Possibly useful discussions: