I am designing the Data Pipeline which consumes data from Salesforce using bulk API endpoint (pull mechanism).
The data comes and lands in an ADLS Gen2 Bronze Layer.
Next transformation job will start and clean the data and push to Silver layer ADLS Gen2. The transformation will be performed by Databricks.
Push the clean records to ADLS Gen2 Silver layer, then using Databricks, I push the clean records to another Databricks environment.
My questions are :
How to handle orchestration?
I have to pull one time full data records, then every 1 hour incremental records where records detected if it is not already present.
Then how to make sure once all the records have arrived, the transformation starts? The records were processed using Databricks.
How to make sure the next step after processing is push records in ADLS Gen2 Silver?
And lastly, how does Databricks know it has to move those records to instance B Databricks as shown in figure?
May someone please suggest how to achieve this.
Which option is scalable, reliable and can handle high throughput?
Image : Logical Flow
Thanks a lot.
For your scenario, the best approach is option 1 : use an azure function for ingestion, orchestrated end-to-end with azure data factory (adf), then transform bronze to silver using databricks.
adf will handle the full orchestration: triggering the azure function, checking file arrival in adls bronze, and kicking off databricks jobs only when all data is ready.
inside databricks, use a simple control table to log which files are processed this ensures transformations only run on complete data.
when transformation finishes, let adf move the cleaned data forward or trigger the next databricks job this chaining is reliable, scalable, and easy to monitor.
this pattern keeps ingestion, transformation, and orchestration loosely coupled but fully automated ideal for high-throughput pipelines.
references: