I have an azure blob container with data which I have not uploaded myself. The data is not locally on my computer. Is it possible to use dvc to download the data to my computer when I haven’t uploaded the data with dvc? Is it possible with dvc import-url? I have tried using dvc pull, but can only get it to work if I already have the data locally on the computer and have used dvc add and dvc push . And if I do it that way, then the folders on azure are not human-readable. Is it possible to upload them in a human-readable format? If it is not possible is there then another way to download data automatically from azure?
I'll build up on @Shcheklein's great answer - specifically on the 'external dependencies' proposal - and focus on your last question, i.e. "another way to download data automatically from Azure".
Let's assume the following:
dvc.yaml
file. The first stage in the current pipeline is called prepare
.dataset/
. This folder follows a structure of sub-folders that we'd like to keep intact.myazure
(more info about DVC 'data remotes' here)One possibility is to start the DVC pipeline by synchronizing a local dataset/
folder with the dataset/
folder on the remote container.
This can be achieved with a command-line tool called azcopy
, which is available for Windows, Linux and macOS.
As recommended here, it is a good idea to add azcopy
to your account or system path, so that you can call this application from any directory on your system.
The high-level idea is:
update_dataset
stage to the DVC pipeline that checks if changes have been made in the remote dataset/
directory (i.e., file additions, modifications or removals).
If changes are detected, the update_datset
stage shall use the azcopy sync [src] [dst]
command to apply the changes on the Azure blob storage container (the [src]
) to the local dataset/
folder (the [dst]
)update_dataset
and the subsequent DVC pipeline stage prepare
, using a 'dummy' file. This file should be added to (a) the outputs of the update_dataset
stage; and (b) the dependencies of the prepare
stage.This procedure has been tested on Windows 10.
update_dataset
stage to the DVC pipeline by running:$ dvc stage add -n update_dataset -d remote://myazure/dataset/ -o .dataset_updated azcopy sync \"https://[account].blob.core.windows.net/[container]/dataset?[sas token]\" \"dataset/\" --delete-destination=\"true\"
Notice how we specify the 'dummy' file .dataset_updated
as an output of the stage.
dvc.yaml
file directly to modify the command of the update_dataset
stage. After the modifications, the command shall (a) create the .dataset_updated
file after the azcopy
command - touch .dataset_updated
- and (b) pass the current date and time to the .dataset_updated
file to guarantee uniqueness between different update events - echo %date%-%time% > .dataset_updated
.stages:
update_dataset:
cmd: azcopy sync "https://[account].blob.core.windows.net/[container]/dataset?[sas token]" "dataset/" --delete-destination="true" && touch .dataset_updated && echo %date%-%time% > .dataset_updated # updated command
deps:
- remote://myazure/dataset/
outs:
- .dataset_updated
...
I recommend editing the dvc.yaml
file directly to modify the command, as I wasn't able to come up with a complete dvc add stage
command that took care of everything in one go.
This is due to the use of multiple commands chained by &&
, special characters in the Azure connection string, and the echo
expression that needs to be evaluated dynamically.
prepare
stage depend on the .dataset_updated
file, edit the dvc.yaml
file directly to add the new dependency, e.g.:stages:
prepare:
cmd: <some command>
deps:
- .dataset_updated # add new dependency here
- ... # all other dependencies
...
prepare
stage:$ dvc repro prepare
The solution presented above is very similar to the example given in DVC's external dependencies documentation.
Instead of the az copy
command, it uses azcopy sync
.
The advantage of azcopy sync
is that it only applies the differences between your local and remote folders, instead of 'blindly' downloading everything from the remote side when differences are detected.
This example relies on a full connection string with an SAS token, but you can probably do without it if you configure azcopy
with your credentials or fetch the appropriate values from environment variables
When defining the DVC pipeline stage, I've intentionally left out an output dependency with the local dataset/
folder - i.e. the -o dataset
part - as it was causing the azcopy
command to fail. I think this is because DVC automatically clears the folders specified as output dependencies when you reproduce a stage.
When defining the azcopy
command, I've included the --delete-destination="true"
option. This allows synchronization of deleted files, i.e. files are deleted on your local dataset
folder if deleted on the Azure container.