I'm trying to use DVC (Data Version Control) to track the version history of datasets in my project.
I have a directory data
which contains the files I want to track. If I run dvc add data
I get a single data.dvc
file that only contains one entry for the whole directory.
outs:
- md5: c11d21fda7e9a03126770a42274f1d31.dir
size: 2169474
nfiles: 4
hash: md5
path: data
If I do dvc add data/*
I get many .dvc
files. Each also contain a single entry, but this time corresponding to the individual file it represents.
How can I create one data.dvc
that contains many entries? I know it is possible because there is a data.dvc
file in a former colleague's repo that accomplishes this and works as expected:
outs:
- hash: md5
path: data
files:
- relpath: file1.txt
md5: 194577a7e20bdcc7afbb718f502c134c
size: 6148
- relpath: file2.txt
md5: fd26fdc537a2ca490a6315bbb35707e7
- relpath: file3.txt
md5: 78032617d0d0f45f10a1cfb7759ec25c
size: 16045
- relpath: file4.txt
md5: bdb4a8a529ff1063a799e37294aa0899
size: 4291
- relpath: file5.txt
md5: dfa872a9b4e6b57cee63cceab0908c42
I'm sure I could create the individual dvc files and then manually collate them, but I would like to use the dvc
tool.
I'm running a recent dvc
version
% dvc --version
3.59.1
Not sure how to use the dvc
tool to accomplish a single data.dvc
file with multiple entries from scratch, but it can be manually formatted to look the desired way for a single initial file, and then the dvc
tool can modify it from there if we add more files. For example, we can use md5 data/somefile1.txt
to compute the hash and then stat -f%z data/somefile1.txt
to compute the size (at least on Mac, elsewhere the command is slightly different). With these two things, we can manually write the file to look like.
outs:
- hash: md5
path: data
files:
- relpath: somefile1.txt
md5: a8f187f9aeb84be1965be784274fadc5
size: 11320
After this point, we can drop more files (somefile2.txt
, somefile3.txt
) in the data/
folder and run dvc add data
to have them be automatically included in our data.dvc
as separate entries in the same file.
outs:
- hash: md5
path: data
files:
- relpath: somefile1.txt
md5: a8f187f9aeb84be1965be784274fadc5
size: 11320
- relpath: somefile2.txt
md5: a791cfc3853813c58a4aedf5ec909f31
size: 176
- relpath: somefile3.txt
md5: 70a656ef678dc642e877a1b13067fb56
size: 258648
Now we can run dvc push
and dvc pull
like we'd expect.