I am trying to build an ETL that map the source tables to a dimensional, star schema model
our data warehouse is basically Impala on top of Kudu database
my question is, should I:
A- build an ETL that deals with kudu tables directly using Python (link)
or
B- or create UDFs (equivalent to stored procedures in SQL) in impala that does the insertion/joins etc to map source tables to star-schema model, and schedule it using Nifi or any scheduler such as Airflow etc
In my opinion, I think it would be better to deal with the native database rather than dealing with the SQL engine on top of it. but it is just an assumption.
Why not approach C, :) a bit of both.
Both has pros and cons.
A - use python to build ETL - pros - better control, flexible to do any logic you want. cons - you have to code in python and code in sql. If something fails, it will be a nightmare to do RCA. Maintenance may be harder in comparison. - performance wise, this approach will be poorer in case of huge volume of data.
B - Use SQL to fetch data directly - pros - faster performance. less coding. cons - difficult to implement complex logic. Maintenance of code and schedule may be hard.
In addition to above, pls consider, your/teams comfort on python/SQL and future maintainability.
Currently we are using approach B in my cloudera project. We create views and then use insert
to load final tables directly. We hardly need any UDF.
Now, my recommendation, please use approach B. And use approach A only in case you really can not create complex logic.
EDIT : Lets say, we have to load orders table. So we execute following blocks to load orders and dependent org,cust,prod tables.
Load customer |
load org | --> Load Orders final.
load product |
load order stage|
Load customer block is collection of scripts like-
insert overwrite cust_stg select * from cust_stg_vw; -- This loads into stage table
insert overwrite cust select * from cust_vw; -- This loads into cust table
And similarly other blocks are written. Putting them in blocks gives us flexibility to put them in any order/anywhere we want to improve performance.