githashversion-controldvc

`dvc add` multiple files in single `data.dvc` file


I'm trying to use DVC (Data Version Control) to track the version history of datasets in my project.

I have a directory data which contains the files I want to track. If I run dvc add data I get a single data.dvc file that only contains one entry for the whole directory.

outs:
- md5: c11d21fda7e9a03126770a42274f1d31.dir
  size: 2169474
  nfiles: 4
  hash: md5
  path: data

If I do dvc add data/* I get many .dvc files. Each also contain a single entry, but this time corresponding to the individual file it represents.

How can I create one data.dvc that contains many entries? I know it is possible because there is a data.dvc file in a former colleague's repo that accomplishes this and works as expected:

outs:
- hash: md5
  path: data
  files:
  - relpath: file1.txt
    md5: 194577a7e20bdcc7afbb718f502c134c
    size: 6148
  - relpath: file2.txt
    md5: fd26fdc537a2ca490a6315bbb35707e7
  - relpath: file3.txt
    md5: 78032617d0d0f45f10a1cfb7759ec25c
    size: 16045
  - relpath: file4.txt
    md5: bdb4a8a529ff1063a799e37294aa0899
    size: 4291
  - relpath: file5.txt
    md5: dfa872a9b4e6b57cee63cceab0908c42

I'm sure I could create the individual dvc files and then manually collate them, but I would like to use the dvc tool.

I'm running a recent dvc version

% dvc --version
3.59.1

Solution

  • Not sure how to use the dvc tool to accomplish a single data.dvc file with multiple entries from scratch, but it can be manually formatted to look the desired way for a single initial file, and then the dvc tool can modify it from there if we add more files. For example, we can use md5 data/somefile1.txt to compute the hash and then stat -f%z data/somefile1.txt to compute the size (at least on Mac, elsewhere the command is slightly different). With these two things, we can manually write the file to look like.

    outs:
    - hash: md5
      path: data
      files:
      - relpath: somefile1.txt
        md5: a8f187f9aeb84be1965be784274fadc5
        size: 11320
    

    After this point, we can drop more files (somefile2.txt, somefile3.txt) in the data/ folder and run dvc add data to have them be automatically included in our data.dvc as separate entries in the same file.

    outs:
    - hash: md5
      path: data
      files:
      - relpath: somefile1.txt
        md5: a8f187f9aeb84be1965be784274fadc5
        size: 11320
      - relpath: somefile2.txt
        md5: a791cfc3853813c58a4aedf5ec909f31
        size: 176
      - relpath: somefile3.txt
        md5: 70a656ef678dc642e877a1b13067fb56
        size: 258648
    

    Now we can run dvc push and dvc pull like we'd expect.