apache-kafka apache-flink google-cloud-dataflow apache-beam akka-stream

Event sourcing with CDC and stream processing

I have few large monolithic databases and applications in my organisation. Any new feature requests mainly get built within those monoliths. I am exploring options to change this trend , by making data and changes in data available as events in topics for new microservices to consume.

This diagram Event streaming design is one part of the design Change data capture(CDC) for each of tables in source database send changes from each table to a topic , use a stream processor to wrangle this data and make meaningful events for microservices to consume. In streaming layer , I am considering options such as Apache Kafka stream , Apache Flink , Apache Beam / Google Cloud dataflow or Akka streams.

My questions are ( I have not worked on complex stream processing )

For stream to work on complex joins and look-ups i need to have a snapshot persisted , at least for static data. How do I do this ? What options do I have with each of the choices above? 2 . Is this pattern common ? What are some of the challenges ?
Looks like Kafka is a good choice, will you be able to share your view ?

I have tried few basic stream processing .Works great , but complexity increases when dealing with multiple streams . Thanks for your help.

Solution

For context, I've worked with nearly all of the streaming technologies that you've mentioned in your post and I've found Apache Flink to be far and away the easiest to work with (although it's worth mentioning just about all of these can accomplish/suit your use-case).

For stream to work on complex joins and look-ups i need to have a snapshot persisted , at least for static data. How do I do this ? What options do I have with each of the choices above? 2 . Is this pattern common ? What are some of the challenges?

Most change-data-capture (CDC) streams that you might use to populate your join values against support the notion of a snapshot. Generally when you make major changes to the data you are syncing or add new sources, etc. It can be useful to perform a one time snapshot that handles loading all of that data into something like a Kafka topic.

Once it's there, you can use a consumer to determine how you want to read it (i.e. from the beginning of the topic, etc.) and you'll also be able to listen to new changes as they come in to ensure they are reflected in your stream as well.

One of the key differences here in something like Apache Flink is its use of "state" for performing these type of operations. In Flink, you'd read this "lookup" stream of snapshots/changes and somewhere in your streaming job would you could store those values into state, which wouldn't require re-reading them each time. As the values changed, you could update the state such that each message coming through your pipeline would simply look up the corresponding value and use it to perform the "join". State itself is fault-tolerant and would be persisted through restarts of the streaming job, failures, etc.

Looks like Kafka is a good choice, will you be able to share your view?

Generally speaking, Kafka is probably the most ubiquitous technology for handling message processing in the streaming world. There are plenty of other options out there, but you really can't go wrong with using Kafka.

It's a good, and very likely, the best choice (although your mileage may vary)