azure-databricksazure-batch-account

Load Data Using Azure Batch Service and Spark Databricks


I have File Azure Blob Storage that I need to load daily into the Data Lake. I am not clear on which approach I should use(Azure Batch Account, Custom Activity or Databricks, Copy Activity ). Please advise me.


Solution

  • To load files from blob storage to datalake, we can use Data Factory pipelines. Since the requirement is to do the copy every day, we must schedule a trigger.

    Schedule Triggers runs the pipeline periodically within a selected time. It uploads files or directory every time the pipeline is started. It replaces the previous copy in destination. So, any changes made on a particular day to that file in blob storage will be reflected in datalake after the next scheduled copy activity.

    You can also use Databricks notebook in a pipeline to do the same. The Databricks notebook contains the copy logic, and this notebook will be run every time the pipeline is triggered.

    You can follow these steps to perform the copy:

    enter image description here

    The key factor is that you must schedule a trigger no matter which method you use, so that the pipeline recurs periodically as per your requirement (24 hours in your case).

    You can refer to the following docs: