By how much can i approx. reduce disk volume by using dvc?
Asked Answered
I

1

9

I want to classify ~1m+ documents and have a Version Control System for in- and Output of the corresponding model.

The data changes over time:

  • sample size increases over time
  • new Features might appear
  • anonymization procedure might Change over time

So basically "everything" might change: amount of observations, Features and the values. We are interested in making the ml model Building reproducible without using 10/100+ GB of disk volume, because we save all updated versions of Input data. Currently the volume size of the data is ~700mb.

The most promising tool i found is: https://github.com/iterative/dvc. Currently the data is stored in a database in loaded in R/Python from there.

Question:

How much disk volume can be (very approx.) saved by using dvc?

If one can roughly estimate that. I tried to find out if only the "diffs" of the data are saved. I didnt find much info by reading through: https://github.com/iterative/dvc#how-dvc-works or other documentation.

I am aware that this is a very vague question. And it will highly depend on the dataset. However, i would still be interested in getting a very approximate idea.

Identical answered 23/2, 2020 at 18:31 Comment(0)
A
14

Let me try to summarize how does DVC store data and I hope you'll be able to figure our from this how much space will be saved/consumed in your specific scenario.

DVC is storing and deduplicating data on the individual file level. So, what does it usually mean from a practical perspective.

I will use dvc add as an example, but the same logic applies to all commands that save data files or directories into DVC cache - dvc add, dvc run, etc.

Scenario 1: Modifying file

Let's imagine I have a single 1GB XML file. I start tracking it with DVC:

$ dvc add data.xml

On the modern file system (or if hardlinks, symlinks are enabled, see this for more details) after this command we still consume 1GB (even though file is moved into DVC cache and is still present in the workspace).

Now, let's change it a bit and save it again:

$ echo "<test/>" >> data.xml
$ dvc add data.xml

In this case we will have 2GB consumed. DVC does not do diff between two versions of the same file, neither it splits files into chunks or blocks to understand that only small portion of data has changed.

To be precise, it calculates md5 of each file and save it in the content addressable key-value storage. md5 of the files serves as a key (path of the file in cache) and value is the file itself:

(.env) [ivan@ivan ~/Projects/test]$ md5 data.xml
0c12dce03223117e423606e92650192c

(.env) [ivan@ivan ~/Projects/test]$ tree .dvc/cache
.dvc/cache
└── 0c
   └── 12dce03223117e423606e92650192c

1 directory, 1 file

(.env) [ivan@ivan ~/Projects/test]$ ls -lh data.xml
data.xml ----> .dvc/cache/0c/12dce03223117e423606e92650192c (some type of link)

Scenario 2: Modifying directory

Let's now imagine we have a single large 1GB directory images with a lot of files:

$ du -hs images
1GB

$ ls -l images | wc -l
1001

$ dvc add images

At this point we still consume 1GB. Nothing has changed. But if we modify the directory by adding more files (or removing some of them):

$ cp /tmp/new-image.png images

$ ls -l images | wc -l
1002

$ dvc add images

In this case, after saving the new version we still close to 1GB consumption. DVC calculates diff on the directory level. It won't be saving all the files that were existing before in the directory.

The same logic applies to all commands that save data files or directories into DVC cache - dvc add, dvc run, etc.

Please, let me know if it's clear or we need to add more details, clarifications.

Athanor answered 23/2, 2020 at 19:57 Comment(4)
Thanks a lot for tolerating the vagueness in my question and the very detailed answer! Very interesting indeed, as initially we a) collect 1m+ pairs of eml file + meta data file, b) store this in a database and c) load in R/Python. So i would summarize that on a) i can calculate the diffs very well with dvc, on b) hardly and if i make Changes after c) i would have to ask our architects/engineers if i can write back on a) in order to use the full potential of dvc, right? (assuming the downside of the resulting circular data flow would be tolerable of Course,..)Identical
@Identical how do you load data into R/Python during the step c)? Do you run some ad-hoc queries? Od do you just filter a predefined set of records?Athanor
It is basically "SELECT * FROM X" every other week. So amt of obs or columns is not predefined. (Hope i understood the question correctly).Identical
Yes, thanks for the clarification! My intuition that in you case it would just make sense to just add an extra step that runs this SQL with dvc run and saves the result into a file. This file is then parsed in your R/Python scripts. You can also dvc add incoming data files as well (to being able to restore the database state in case you need this). Probably, my point here is that it looks like you have two different processes ETL and then the actual analysis/model building. I would first decide which one should be reproducible.Athanor

© 2022 - 2024 — McMap. All rights reserved.