By how much can i approx. reduce disk volume by using dvc?

I want to classify ~1m+ documents and have a Version Control System for in- and Output of the corresponding model.

The data changes over time:

sample size increases over time
new Features might appear
anonymization procedure might Change over time

So basically "everything" might change: amount of observations, Features and the values. We are interested in making the ml model Building reproducible without using 10/100+ GB of disk volume, because we save all updated versions of Input data. Currently the volume size of the data is ~700mb.

The most promising tool i found is: https://github.com/iterative/dvc. Currently the data is stored in a database in loaded in R/Python from there.

Question:

How much disk volume can be (very approx.) saved by using dvc?

If one can roughly estimate that. I tried to find out if only the "diffs" of the data are saved. I didnt find much info by reading through: https://github.com/iterative/dvc#how-dvc-works or other documentation.

I am aware that this is a very vague question. And it will highly depend on the dataset. However, i would still be interested in getting a very approximate idea.

Let me try to summarize how does DVC store data and I hope you'll be able to figure our from this how much space will be saved/consumed in your specific scenario.

DVC is storing and deduplicating data on the individual file level. So, what does it usually mean from a practical perspective.

I will use dvc add as an example, but the same logic applies to all commands that save data files or directories into DVC cache - dvc add, dvc run, etc.

Scenario 1: Modifying file

Let's imagine I have a single 1GB XML file. I start tracking it with DVC:

$ dvc add data.xml

On the modern file system (or if hardlinks, symlinks are enabled, see this for more details) after this command we still consume 1GB (even though file is moved into DVC cache and is still present in the workspace).

Now, let's change it a bit and save it again:

$ echo "<test/>" >> data.xml
$ dvc add data.xml

In this case we will have 2GB consumed. DVC does not do diff between two versions of the same file, neither it splits files into chunks or blocks to understand that only small portion of data has changed.

To be precise, it calculates md5 of each file and save it in the content addressable key-value storage. md5 of the files serves as a key (path of the file in cache) and value is the file itself:
(.env) [ivan@ivan ~/Projects/test]$ md5 data.xml
0c12dce03223117e423606e92650192c

(.env) [ivan@ivan ~/Projects/test]$ tree .dvc/cache
.dvc/cache
└── 0c
   └── 12dce03223117e423606e92650192c

1 directory, 1 file

(.env) [ivan@ivan ~/Projects/test]$ ls -lh data.xml
data.xml ----> .dvc/cache/0c/12dce03223117e423606e92650192c (some type of link)

Scenario 2: Modifying directory

Let's now imagine we have a single large 1GB directory images with a lot of files:

$ du -hs images
1GB

$ ls -l images | wc -l
1001

$ dvc add images

At this point we still consume 1GB. Nothing has changed. But if we modify the directory by adding more files (or removing some of them):

$ cp /tmp/new-image.png images

$ ls -l images | wc -l
1002

$ dvc add images

In this case, after saving the new version we still close to 1GB consumption. DVC calculates diff on the directory level. It won't be saving all the files that were existing before in the directory.

The same logic applies to all commands that save data files or directories into DVC cache - dvc add, dvc run, etc.

Please, let me know if it's clear or we need to add more details, clarifications.

Scenario 1: Modifying file

Scenario 2: Modifying directory

Recommended topics

Hot tags