Let's say I am working inside a git/dvc repo. There is a folder data
containing 100k small files. I track it with DVC as a single element, as recommended by the doc:
dvc add data
and because in my experience, DVC is kinda slow when tracking that many files one by one.
I clone the repo on another workspace, and now I have the data.dvc
file locally but none of the actual files inside yet. I want to add a file named newfile.txt
to the data
folder and track it with DVC. Is there a way to do this without pulling the whole content of data
locally ?
What I have tried for now:
Adding the
data
folder again:mkdir data mv path/to/newfile.txt data/newfile.txt dvc add data
The
data.dvc
file is built again from the local state ofdata
which only containsnewfile.txt
so this doesn't work.Adding the file as a single element in
data
folder:dvc add data/newfile.txt
I get :
Cannot add 'data/newfile.txt', because it is overlapping with other DVC tracked output: 'data'. To include 'data/newfile.txt' in 'data', run 'dvc commit data.dvc'
Using dvc commit as suggested
mkdir data mv path/to/newfile.txt data/newfile.txt dvc commit data.dvc
Similarly as 1., the
data.dvc
is rebuilt again from local state ofdata
.
DVC is kinda slow when tracking that many files one by one.
The limitation is more in the file system. The way DVC works is that it creates a .dvc metafile for each file/dir you want to track as an entity. Adding 100k files separately then requires managing 100k metafiles which involves lots of I/O operations. That's why DVC allows granularity in most of its commands even if you track entire directories, e.g. you candvc add data; dvc push data/some/file
– Defalcate