I am currently working on a lineage system that will be deployed in our google-cloud space, the goal is to extract and parse SQL queries executed from BigQuery using audit logs and create a lineage out of those. I explored a couple of existing products but eventually decided to build my own.
I came across the Google's POC and documentation (https://cloud.google.com/architecture/building-a-bigquery-data-lineage-solution) for data-lineage which led me to explore the usage of zetaSQL.
I was able to parse simple queries and extract referenced tables and output columns from it using Analyzer.extractTableNamesFromStatement(query)
and Analyzer.analyzeStatement(query, analyzerOptions, simpleCatalog
. However, I was wondering if zetaSQL is capable of achieving these things below, and if so, It would be really helpful if you can give me some examples or path to take.
ResolvedNodes
?additionally, I was also exploring the usage of SimpleCatalog
and
I was wondering.
Thank you.
I'll answer my own question for bullet number 1, it seems it is depending on the type of node. for example, if you are trying to obtain lineage on the functions applied on a column, We could use ResolvedFunctionCall
to obtain transformations applied and from there you can do a visit on node ResolvedColumnRef
which enables you to get source column for that expression.