filesystemsdvc

What are the file name rules in DVC and can they be controlled via config?


Use case: id-10T proofing data removal with a zero-trust command.

I am looking through the documentation and I don't see clear cut guidelines for what can possibly go into DVC as a file name.

Right now, I know that DVC implements some name filtration. I cannot, for example, add a file with a newline:

$: touch 'foo
bar.txt'
$: dvc add foo$'\n'bar.txt
Adding...
ERROR: output 'foobar.txt' does not exist

Can someone point me to the documentation that explains exactly what is allowed to go into the yaml file as a path?


Solution

  • There is no documentation on allowed filenames in DVC, but the issue is that DVC currently uses urllib.urlsplit and urllib.urlunsplit when normalizing path names, and the newline gets removed by urlsplit since it's not a valid path character for RFC-compliant URLs. DVC needs to support both local paths and remote URL paths like s3://bucket/object/path, so currently it treats everything as a URL.

    The intended behavior is that DVC should support any character that is valid for your local filesystem, so it seems pretty clear that this is a bug - DVC should account for invalid URL characters that are valid for local filesystems. I've opened a report which you can follow for further updates: https://github.com/iterative/dvc-objects/issues/177