How to add a file to a dvc-tracked folder without pulling the whole folder's content?
Asked Answered
dvc
U

1

6

Let's say I am working inside a git/dvc repo. There is a folder data containing 100k small files. I track it with DVC as a single element, as recommended by the doc:

dvc add data

and because in my experience, DVC is kinda slow when tracking that many files one by one.

I clone the repo on another workspace, and now I have the data.dvc file locally but none of the actual files inside yet. I want to add a file named newfile.txt to the data folder and track it with DVC. Is there a way to do this without pulling the whole content of data locally ?

What I have tried for now:

  1. Adding the data folder again:

    mkdir data
    mv path/to/newfile.txt data/newfile.txt
    dvc add data
    

    The data.dvc file is built again from the local state of data which only contains newfile.txt so this doesn't work.

  2. Adding the file as a single element in data folder:

     dvc add data/newfile.txt
    

    I get :

     Cannot add 'data/newfile.txt', because it is overlapping with other DVC tracked output: 'data'. 
     To include 'data/newfile.txt' in 'data', run 'dvc commit data.dvc'
    
  3. Using dvc commit as suggested

     mkdir data
     mv path/to/newfile.txt data/newfile.txt
     dvc commit data.dvc
    

    Similarly as 1., the data.dvc is rebuilt again from local state of data.

Urquhart answered 6/5, 2021 at 15:25 Comment(1)
re DVC is kinda slow when tracking that many files one by one. The limitation is more in the file system. The way DVC works is that it creates a .dvc metafile for each file/dir you want to track as an entity. Adding 100k files separately then requires managing 100k metafiles which involves lots of I/O operations. That's why DVC allows granularity in most of its commands even if you track entire directories, e.g. you can dvc add data; dvc push data/some/fileDefalcate
D
3

I clone the repo on another workspace, and now I have the data.dvc file locally but none of the actual files inside yet (haven't dvc pulled). I want to add a file to the data folder and track it with DVC. Is there a way to do this without pulling the whole content of data locally ?

Interesting question. I think there is no easy way to do this now because in this other machine if you dvc add data again but with only one file in there, DVC will think you deleted all the other files, create a new cached version of the data dir (containing only the new file), and update the .dvc file accordingly (as you discovered).

You could open a feature request in https://github.com/iterative/dvc.org/issues.

Doorstone answered 7/5, 2021 at 1:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.