apache-kafkadesign-patternsarchitecturesystem-design

Handling Multi-Table Updates in Kafka


We have an application that serves as a configuration repository, storing data in a relational database. Whenever a user changes a configuration item, it is persisted to the database and our goal is to communicate these changes via Kafka messages to subscribers.

In the web UI, users can modify multiple sections of the configuration, often affecting multiple tables. After initial research, I understand that the typical way to model relational DB-like applications is by using one Kafka topic per table. However, this approach triggers multiple Kafka messages to different topics when a user modifies multiple tables.

How can consumers know how long to wait until all modifications are completed to react to changes? Or can you suggest an alternative way to model something similar?


Solution

  • After initial research, I understand that the typical way to model relational DB-like applications is by using one Kafka topic per table

    It depends on the actual use-case. It is just fine to have a single topic which will have all the changes grouped into a single message if that is an appropriate approach. I.e. you will have message looking like the following:

    {
       "user":"...",
       "changes":[
          {
             "entity":"one...",
              ... // changes
          },
          {
             "entity":"two...",
              ... // changes
          }
       ]
    }
    

    Note that if you can have A LOT of changes you might stumble upon the Kafka message size limit.

    Another approach can be to have still a single topic but have a message per table and have some metadata that will have an unique "transaction"/operation id and number of changes in the "transaction"/operation:

    {
       "user":"...",
       "metadata":{
          "transaction":"unique_tran_id",
          "operationNumber":"unique_operation_in_tran_id",
          "changesInTransaction":"total_num_of_changes"
       },
       "entity":"",
       ... // changes
    }
    

    Then you can have some saga-like approach when the processor will accumulate and count the processed operations and decide when processing is completed (using the count of the processed items + deduplication based on the operationNumber). This approach can also be spanned across multiple topics if needed. Though one note - IMO this approach is better be coupled with transactional outbox pattern so you minimize the chance that some "transaction"/operation will have only some messages published (i.e. processing of the "transaction"/operation will never complete).

    Also do not forget - you are not limited to a single output Kafka. There are approaches when you have "business" and "technical" output queues, for example you can have a queue per table as "technical" ones and the "business" one using one of the first two approaches.