[SOLVED] Difference between Structured Streaming and Delta Live Tables in Databricks

Difference between Structured Streaming and Delta Live Tables in Databricks

I'm interested in what is the difference between Structured Streaming and Delta Live Tables. Databricks said For most streaming or incremental data processing or ETL tasks, Databricks recommends Delta Live Tables.

Does it mean I should always stick to DLT, and Structured Streaming is an old feature?

Solution

TL;DR - DLT = SaaS Structured Streaming, makes streaming simple to implement at a cost ($$).

DLT

provides DSL to let you write your streaming code with fewer lines of code. A simple example (though DLT offers a lot more). E.g. using structured streaming to stream from json files at /path/to/json/file/streams/taxi_raw to a delta table at /path/to/delta/tables/filtered_data:

df_taxi_raw = spark.readStream.json('/databricks-datasets/nyctaxi/sample/json/')
df_taxi_raw.writeStream.format('delta').start('/path/to/delta/tables/taxi_raw')

df_filtered_data = spark.readStream.format("delta").load("/path/to/delta/tables/taxi_raw").where(...)
df_filtered_data.writeStream.format('delta').start('/path/to/delta/tables/filtered_data')

Same thing using DLT:

import dlt

@dlt.view
def taxi_raw():
  return spark.read.format("json").load("/path/to/json/file/streams/taxi_raw")

@dlt.table(name="filtered_data")
def create_filtered_data():
  return dlt.read("taxi_raw").where(...)

It's an additional cost.
[Opinion] It's pretty new and we didn't go for it as we have been bled by "bleeding edge features" before. YMMV.

... Databricks recommends Delta Live Tables.

Does it mean I should always stick to DLT, and Structured Streaming is an old feature?

"Databricks recommends" because they're in business of making money, not because DLT is the "new feature" replacing an older one. It's more like Walmart recommending "Walmart+" though it's not necessary to shop at Walmart.

E.g. Spark Streaming is replaced by Structured Streaming. Both are features of Spark OSS. Whereas DLT is a proprietary feature of Databricks built on top of Spark.

Structured Streaming is developed by Apache and will continue adding new features. Spark Streaming is deprecated.

Understand the cost and benefits and then decide. You can do streaming using either DLT or stock Spark Structured Streaming.