airflowetlcamundadata-pipeline

Camunda as scheduler and orchestrator of data-pipeline / ETL


I would like to know if anyone implemented Camunda as scheduler and orchestrator of data pipelines/ETL and can share his experience.

What are the pros and cons of using it instead of Airflow for example?

Thanks!


Solution

  • Camunda

    Camunda does not offer connectors (like S3, database, mongo, rabbitmq, kafka, powerBi) which only makes it a weak candidate for ETL. One may say that you have custom processors - then yes - you need to write Java for those and achieve ETL. I found it suitable for human in the loop decision process modeling.

    Apache Airflow

    I have tried numerous experiments in Apache Airflow https://github.com/kurtzace/airflow-experiments - this one can make DAGs well. Has numerous connectors ready to be used . Of course with a little bit of python .Using Spiff - we can achieve BPMN type experiments. Needs lesser code when compared to Camunda and Apache airflow.

    cons: high learning curve - mostly used for datascience pipelines

    Apache Nifi

    But on the other extremity - I found Apache Nifi to be better suited for it. Needs lesser code as compared. Possesses Many prebuilt processors like - Batch/file, http/https/rest, S3, json transformers, csv transformers, db connectivity, concat, merge, filter.

    Cons: Nifi is not good for a. more than 15 min processing b. behave like spark distributed computer c. Data volumes becomes more than a gb per connection d. complex joins, rolling window, e. rabbitmq type eventing