azure-blob-storagedvc

Downloading data from azure storage explorer using dvc


I have an azure blob container with data which I have not uploaded myself. The data is not locally on my computer. Is it possible to use dvc to download the data to my computer when I haven’t uploaded the data with dvc? Is it possible with dvc import-url? I have tried using dvc pull, but can only get it to work if I already have the data locally on the computer and have used dvc add and dvc push . And if I do it that way, then the folders on azure are not human-readable. Is it possible to upload them in a human-readable format? If it is not possible is there then another way to download data automatically from azure?


Solution

  • I'll build up on @Shcheklein's great answer - specifically on the 'external dependencies' proposal - and focus on your last question, i.e. "another way to download data automatically from Azure".

    Assumptions

    Let's assume the following:

    High-level idea

    One possibility is to start the DVC pipeline by synchronizing a local dataset/ folder with the dataset/ folder on the remote container.

    This can be achieved with a command-line tool called azcopy, which is available for Windows, Linux and macOS. As recommended here, it is a good idea to add azcopy to your account or system path, so that you can call this application from any directory on your system.

    The high-level idea is:

    1. Add an initial update_dataset stage to the DVC pipeline that checks if changes have been made in the remote dataset/ directory (i.e., file additions, modifications or removals). If changes are detected, the update_datset stage shall use the azcopy sync [src] [dst] command to apply the changes on the Azure blob storage container (the [src]) to the local dataset/ folder (the [dst])
    2. Add a dependency between update_dataset and the subsequent DVC pipeline stage prepare, using a 'dummy' file. This file should be added to (a) the outputs of the update_dataset stage; and (b) the dependencies of the prepare stage.

    Implementation

    This procedure has been tested on Windows 10.

    1. Add a simple update_dataset stage to the DVC pipeline by running:
    $ dvc stage add -n update_dataset -d remote://myazure/dataset/ -o .dataset_updated azcopy sync \"https://[account].blob.core.windows.net/[container]/dataset?[sas token]\" \"dataset/\" --delete-destination=\"true\"
    

    Notice how we specify the 'dummy' file .dataset_updated as an output of the stage.

    1. Edit the dvc.yaml file directly to modify the command of the update_dataset stage. After the modifications, the command shall (a) create the .dataset_updated file after the azcopy command - touch .dataset_updated - and (b) pass the current date and time to the .dataset_updated file to guarantee uniqueness between different update events - echo %date%-%time% > .dataset_updated.
    stages:
      update_dataset:
        cmd: azcopy sync "https://[account].blob.core.windows.net/[container]/dataset?[sas token]" "dataset/" --delete-destination="true" && touch .dataset_updated && echo %date%-%time% > .dataset_updated # updated command
        deps:
        - remote://myazure/dataset/
        outs:
        - .dataset_updated
    ...
    

    I recommend editing the dvc.yaml file directly to modify the command, as I wasn't able to come up with a complete dvc add stage command that took care of everything in one go. This is due to the use of multiple commands chained by &&, special characters in the Azure connection string, and the echo expression that needs to be evaluated dynamically.

    1. To make the prepare stage depend on the .dataset_updated file, edit the dvc.yaml file directly to add the new dependency, e.g.:
    stages:
      prepare:
        cmd: <some command>
        deps:
        - .dataset_updated # add new dependency here
        - ... # all other dependencies
    ...
    
    1. Finally, you can test different scenarios on your remote side - e.g., adding, modifying or deleting files - and check what happens when you run the DVC pipeline up till the prepare stage:
    $ dvc repro prepare
    

    Notes