airflowetllarge-dataelt

Can apache airflow handle ETL process for approximately 50TB of data?


I have around 50TB of data in my database which needs to be extracted from this source database and loaded into a downstream system. This data source is an on-premises RDBMS. I would like to use Apache Airflow to schedule this ETL process.

Can Airflow be used to schedule an ETL process that transfers ~50TB of data from a source system to a destination system?


Solution

  • The short answer is "it depends".

    Airflow strictly as an Orchestrator

    Airflow is most typically known for its capabilities as an "orchestrator". In this sense, Airflow is used to define some schedule for a task to execute work on third-party systems. For example, Airflow can trigger a spark job every day at midnight. Maybe this spark job is executed on a Databricks cluster, on a different cloud spark service like Amazon EMR / Google Dataproc, or on an on-premise Spark cluster.

    A solution like this allows you to offload large ETL tasks to tools that are purpose-built for this kind of work, and can be done asynchronously from an Airflow perspective.

    Executing work on Airflow

    Actual ETL work can be performed on Airflow workers, and sometimes it's entirely appropriate to do so. Whether or not it's appropriate to execute the data transfer on Airflow workers depends on several factors including:

    1. Do you need to transfer 50TB every day?
    2. How quickly does the 50TB need to be transferred?
    3. How much disk space or memory is available on your Airflow workers?

    If you only need to transfer 50TB of data once and then never again, it doesn't really make sense to use Airflow unless you already have an existing Airflow environment with the capacity to execute such a transfer.

    If you need to execute such a transfer regularly and you have sufficient compute resources available to Airflow to transfer the data in an acceptable amount of time, then sure, you can use Airflow. For example, if you need to back up a 50TB data warehouse table, perhaps you can capture the number of records, and use dynamic task mapping to iterate through chunks of the table, writing each chunk to cloud storage in parquet format.

    50TB is quite a bit, though, and it might be more cost and time efficient to use a more optimized distributed computing system like Spark to execute the work.