I'm interested in what is the difference between Structured Streaming and Delta Live Tables. Databricks said For most streaming or incremental data processing or ETL tasks, Databricks recommends Delta Live Tables
.
Does it mean I should always stick to DLT, and Structured Streaming is an old feature?
TL;DR - DLT = SaaS Structured Streaming, makes streaming simple to implement at a cost ($$).
DLT
/path/to/json/file/streams/taxi_raw
to a delta table at /path/to/delta/tables/filtered_data
:df_taxi_raw = spark.readStream.json('/databricks-datasets/nyctaxi/sample/json/')
df_taxi_raw.writeStream.format('delta').start('/path/to/delta/tables/taxi_raw')
df_filtered_data = spark.readStream.format("delta").load("/path/to/delta/tables/taxi_raw").where(...)
df_filtered_data.writeStream.format('delta').start('/path/to/delta/tables/filtered_data')
Same thing using DLT:
import dlt
@dlt.view
def taxi_raw():
return spark.read.format("json").load("/path/to/json/file/streams/taxi_raw")
@dlt.table(name="filtered_data")
def create_filtered_data():
return dlt.read("taxi_raw").where(...)
... Databricks recommends Delta Live Tables.
Does it mean I should always stick to DLT, and Structured Streaming is an old feature?
"Databricks recommends" because they're in business of making money, not because DLT is the "new feature" replacing an older one. It's more like Walmart recommending "Walmart+" though it's not necessary to shop at Walmart.
E.g. Spark Streaming is replaced by Structured Streaming. Both are features of Spark OSS. Whereas DLT is a proprietary feature of Databricks built on top of Spark.
Structured Streaming is developed by Apache and will continue adding new features. Spark Streaming is deprecated.
Understand the cost and benefits and then decide. You can do streaming using either DLT or stock Spark Structured Streaming.