data-warehousesnowflake-cloud-data-platformdata-lineagedbt

What are the options when it comes to handling Data Lineage in Snowflake?


Any ideas/options about handling Data Lineage in Snowflake? We are following a microservice architecture in which we are running a set of stored procedures that contain quite a few SQL queries as soon as certain events are triggered.

Example: When Table A is populated execute SP_Populate_Table_B and the result is that Table B is populated. We have a big set of SPs as we are populating the Staging Area, DataVault and our Dimensional Model.

We are in the lookout for any good way of handling all the metadata around this microservice way of performing our ETL. Basically automated way to track dependencies between tables, visualize the orchestration, have a better way to handle the changes of the SPs when tables are changed etc.

Can you please advice for some frameworks or tools, preferably open-source, that you have tried for Snowflake? Will DBT be a solution to that?

Thank you Pantelis


Solution

  • dbt is a good solution to deploying your warehouse as code, but not a great solution for using your warehouse as a db for services to write intermediary tables.

    If you care about data lineage, and you're willing to rethink the SP approach, then I would recommend dbt as a tool to deploy your warehouse infrastructure as code, and easily understand the downstream dependencies of your data.

    dbt is great if you are willing to approach everything as an ELT problem, and allow dbt to be the infrastructure that transforms a subset of your mass-loaded data/events, into something that is ready to be analyzed or ingested for BI.

    Read this for more context: https://discourse.getdbt.com/t/understanding-idempotent-data-transformations/518