I am little confused as to how DVC keeps track of changes within datasets. If I execute "dvc add ./data/a.csv", then dvc adds ./data/a.csv into ./data/.gitignore and creates a ./data/a.csv.dvc file. On the other hand if I have something like this in a dvc.yaml file:
stages:
gen-ref-arts:
cmd: make gen-ref-arts
outs:
- ./data/b.csv
Executing "dvc repro" then DVC adds ./data/b.csv into ./data/.gitignore, however, it does not create a b.csv.dvc file.
In the documentation of DVC (https://dvc.org/doc/start/data-pipelines/data-pipelines) I can read: "DVC uses the pipeline definition to automatically track the data used and produced by any stage, so there's no need to manually run dvc add for data/prepared!"
Why does it not generate a ./data/b.csv.dvc file? Is this normal? If so why?
Pipeline outputs are tracked by dvc.lock
file. It has a similar structure to .dvc
files, but combines information across multiple stages. That was done for simplicity in case of complex pipelines.
See more details here - dvc.lock file.