sql-serverapache-kafkaapache-kafka-connectdebeziumcdc

How to increase debezium / kafka connect performance for initial snapshot of millions of records and enable snapshot parallely if possible?


Usecase: I have 700+ tables in sql server database and have high volume of data in each table. Each table is having 20-50 millions of records and I need to run debezium on all tables for initaial snapshot and push them to kafka.

Tools used:

  1. Kafka 3.3.1
  2. Debezium 2.0
  3. Apicurio regitry
  4. Avro convertor

Analysis: Taking snapshot of 1 table with 50 million records almost takes 6 hours to get complete. FYI, I have set following properties set and have tried various values for them

offset.flush.timeout.ms=60000
offset.flush.interval.ms=10000
max.request.size=20485760
max.batch.size=30000
max.queue.size=200000

But performance is not increasing.

Problem: Considering many tables and since snapshot works sequentially, It will take days to take initial snapshot, how can I resolve this problem?

TIA


Solution

  • Starting from Debezium 2.2.0, it is possible to snapshot multiple tables in parallel, as explained in their blog post.

    To define how many tables should be snapshotted in parallel, you should set the following parameter in the Debezium connector config:

    snapshot.max.threads=4
    

    If this value is greater than 1, then snapshot parallelism will be enabled.