I'd like to have these remotes (from .dvc/config
):
['remote "test-data"']
url = gs://some-test-bucket/dvc
['remote "prod-data"']
url = gs://some-prod-bucket/dvc
(I have not set a default remote.)
And I have some test data in the folder ./test-data
, and production data in the folder ./prod-data.
The .dvc file for prod data:
$ cat prod-data.dvc
outs:
- md5: 057682599b100f0240ca51b6256ed7d5.dir
size: 135840994497
nfiles: 17008
hash: md5
path: prod-data
Example of .dvc file for test data:
$ cat test-data/some_folder.dvc
outs:
- md5: c06520abe0140c72004dbe4494a78b23.dir
size: 692847854
nfiles: 8
hash: md5
path: some_folder
I want the command dvc pull -r prod-data
to only give me the ./prod-data/
folder, but instead it's fetching more:
$ dvc pull -r prod-data
A prod-data/
A test-data/some_folder/
A test-data/some_other_folder_entirely/
3 files added
How can I set this up so that the test files are stored in one remote, while the prod data is stored in another? Maybe I'm misunderstanding how DVC should be used?
Thanks!
In this setup dvc pull -r prod-data
and dvc push -r prod-data
try to pull / push all data (all .dvc
files). Unless you explicitly specify a target: dvc pull -r test-data test-data/some_folder
.
To actually split data by a few remotes, you need to use the remote field:
outs:
- md5: c06520abe0140c72004dbe4494a78b23.dir
size: 692847854
nfiles: 8
hash: md5
path: some_folder
remote: test-data
At the moment, I don't think it can be specified as part of the dvc add
command that creates the .dvc
files for you. It's expected that you manage this (and some other fields) manually.
After it's done, you won't need also to keep specifying -r
on dvc pull
/ dvc push
. It will pick this automatically.
Please, give it a try and let me know if you hit some issues.