databricksdbt

how to use dbt for data lakehouse when using Python or Spark for transformations


My organization has a data lakehouse (in Databricks). We primarily use Python and Spark (not SQL) to transform data in the lakehouse. (We only use SQL for maintaining schema and loading tables.) The lakehouse architect is recommending that we use dbt for our data transformations. That confuses me. To my knowledge, dbt is for SQL-based transformations, not Python- or Spark-based transformations. Is that not true?


Solution

  • DBT is a data transformation tool, and as per my understanding it acts as a transformation layer on top of your underlying data platform, which can include relational databases like Postgres,MySQL or cloud data platforms like Snowflake, Redshift, Databricks etc

    Supported adapters by DBT => DBT Adapters

    Note: Databricks is also a supported adapter by DBT

    To your question whether it supports Spark transformation, yes it does as per documentation

    dbt-spark can connect to Spark clusters by four different methods:

    • odbc is the preferred method when connecting to Databricks. It supports connecting to a SQL Endpoint or an all-purpose interactive cluster.

    • thrift connects directly to the lead node of a cluster, either locally hosted / on premise or in the cloud (e.g. Amazon EMR).

    • http is a more generic method for connecting to a managed service that provides an HTTP endpoint. Currently, this includes connections to a Databricks interactive cluster.

    • session connects to a pySpark session, running locally or on a remote machine.

    DBT also supports python models as per documentation

    A dbt Python model is a function that reads in dbt sources or other models, applies a series of transformations, and returns a transformed dataset. DataFrame operations define the starting points, the end state, and each step along the way.