I'm currently in the process of establishing Delta live tables in Azure Databricks to support real-time use cases. Specifically, for a given data source, let's assume I have 10 tables. I've set up a single Delta live pipeline for these tables with a scheduling frequency of every 3 hours. However, this approach is proving to be quite costly. I'm seeking guidance on best practices for optimizing my use of Delta live tables.
Here are a few additional details to consider:
Data Format: CSV Cluster bandwidth while running DLT pipeline: Fixed bandwidth with 4 worker and 1 worker Full Load Data Volume: Exceeds 250 million records. Incremental Load Data Volume: Over 10 million records
Please suggest me few best practices i should take to reduce the cost
Absolutely, above points are up to the mark. Below are some more points I have experienced and helped me to reduce cost.
Optimizing Clusters:
Cost Efficiency and Usage:
Mode Variations and Cost Control: Differences between Development and Production modes. Note the extra 2-hour cluster runtime after job completion in Development mode, impacting costs. In Production mode, clusters end immediately after job completion.
Adjust "Development mode" settings to manage costs by changing cluster shutdown delays in Pipeline settings. set pipelines.clusterShutdown.delay to 60s