impalakuduapache-kudu

ETL choice, building an ETL that deals with SQL query engine (impala) or native database directly?


I am trying to build an ETL that map the source tables to a dimensional, star schema model

our data warehouse is basically Impala on top of Kudu database

my question is, should I:

A- build an ETL that deals with kudu tables directly using Python (link)

or

B- or create UDFs (equivalent to stored procedures in SQL) in impala that does the insertion/joins etc to map source tables to star-schema model, and schedule it using Nifi or any scheduler such as Airflow etc

In my opinion, I think it would be better to deal with the native database rather than dealing with the SQL engine on top of it. but it is just an assumption.


Solution

  • Why not approach C, :) a bit of both.

    Both has pros and cons.

    In addition to above, pls consider, your/teams comfort on python/SQL and future maintainability.
    Currently we are using approach B in my cloudera project. We create views and then use insert to load final tables directly. We hardly need any UDF.
    Now, my recommendation, please use approach B. And use approach A only in case you really can not create complex logic.

    EDIT : Lets say, we have to load orders table. So we execute following blocks to load orders and dependent org,cust,prod tables.

    Load customer   |
    load org        | --> Load Orders final.
    load product    |
    load order stage|
    

    Load customer block is collection of scripts like-

    insert overwrite cust_stg select * from cust_stg_vw; -- This loads into stage table
    insert overwrite cust select * from cust_vw; -- This loads into cust table
    

    And similarly other blocks are written. Putting them in blocks gives us flexibility to put them in any order/anywhere we want to improve performance.