Let me try to summarize how does DVC store data and I hope you'll be able to figure our from this how much space will be saved/consumed in your specific scenario.
DVC is storing and deduplicating data on the individual file level. So, what does it usually mean from a practical perspective.
I will use dvc add
as an example, but the same logic applies to all commands that save data files or directories into DVC cache - dvc add
, dvc run
, etc.
Scenario 1: Modifying file
Let's imagine I have a single 1GB XML file. I start tracking it with DVC:
$ dvc add data.xml
On the modern file system (or if hardlinks
, symlinks
are enabled, see this for more details) after this command we still consume 1GB (even though file is moved into DVC cache and is still present in the workspace).
Now, let's change it a bit and save it again:
$ echo "<test/>" >> data.xml
$ dvc add data.xml
In this case we will have 2GB consumed. DVC does not do diff between two versions of the same file, neither it splits files into chunks or blocks to understand that only small portion of data has changed.
To be precise, it calculates md5
of each file and save it in the content addressable key-value storage. md5
of the files serves as a key (path of the file in cache) and value is the file itself:
(.env) [ivan@ivan ~/Projects/test]$ md5 data.xml
0c12dce03223117e423606e92650192c
(.env) [ivan@ivan ~/Projects/test]$ tree .dvc/cache
.dvc/cache
└── 0c
└── 12dce03223117e423606e92650192c
1 directory, 1 file
(.env) [ivan@ivan ~/Projects/test]$ ls -lh data.xml
data.xml ----> .dvc/cache/0c/12dce03223117e423606e92650192c (some type of link)
Scenario 2: Modifying directory
Let's now imagine we have a single large 1GB directory images
with a lot of files:
$ du -hs images
1GB
$ ls -l images | wc -l
1001
$ dvc add images
At this point we still consume 1GB. Nothing has changed. But if we modify the directory by adding more files (or removing some of them):
$ cp /tmp/new-image.png images
$ ls -l images | wc -l
1002
$ dvc add images
In this case, after saving the new version we still close to 1GB consumption. DVC calculates diff on the directory level. It won't be saving all the files that were existing before in the directory.
The same logic applies to all commands that save data files or directories into DVC cache - dvc add
, dvc run
, etc.
Please, let me know if it's clear or we need to add more details, clarifications.