I consider to learn about using dvc (https://dvc.org/), but before that I have some questions regarding dvc with cloud:
Does DVC saves all the different versions of the dataset?
Yes, it works on a file level. Please find more details here By how much can i approx. reduce disk volume by using dvc?
You can control though which version to keep / save.
Does DVC support all data files format (csv, feather)?
Yes, it's format-agnostic. It doesn't matter which format to use. It also means it doesn't do anything specific to CSV. It won't be trying to compress it, or calculate some diff in a smart way.
Can the usage of DVC with the could, lead to extra costs, since it increase the frequency of the communication with the cloud?
I would not worry about communication costs (unless you move millions or billions of files). But saving multiple versions of a file leads to paying for both of those versions.
Is there a limitation or disadvantages of the tool when working with large data files(100GB+)?
It has additional cost of calculating the file hash (md5
) to use as a key in its storage. If file is large that takes some extra time to do. Still, saving those files to the cloud and back should be more expensive.
I didn't run benchmarks, but I also can imagine there are some tools like s5cmd
, etc that specialize in optimizing data transfer speeds in such cases. DVC doesn't do any tricks for this at the moment.