My organization has a data lakehouse (in Databricks). We primarily use Python and Spark (not SQL) to transform data in the lakehouse. (We only use SQL for maintaining schema and loading tables.) The lakehouse architect is recommending that we use dbt for our data transformations. That confuses me. To my knowledge, dbt is for SQL-based transformations, not Python- or Spark-based transformations. Is that not true?
DBT is a data transformation tool, and as per my understanding it acts as a transformation layer on top of your underlying data platform, which can include relational databases like Postgres,MySQL or cloud data platforms like Snowflake, Redshift, Databricks etc
Supported adapters by DBT => DBT Adapters
Note: Databricks is also a supported adapter by DBT
To your question whether it supports Spark transformation, yes it does as per documentation
dbt-spark can connect to Spark clusters by four different methods:
odbc
is the preferred method when connecting to Databricks. It supports connecting to a SQL Endpoint or an all-purpose interactive cluster.
thrift
connects directly to the lead node of a cluster, either locally hosted / on premise or in the cloud (e.g. Amazon EMR).
http
is a more generic method for connecting to a managed service that provides an HTTP endpoint. Currently, this includes connections to a Databricks interactive cluster.
session
connects to a pySpark session, running locally or on a remote machine.
DBT also supports python models as per documentation
A dbt Python model is a function that reads in dbt sources or other models, applies a series of transformations, and returns a transformed dataset. DataFrame operations define the starting points, the end state, and each step along the way.