Is it possible to check that the version of a file tracked by a DVC metadata file exists in remote storage without pulling the file?

Asked 28/5, 2021 at 20:10 Answered 8/8, 2023 at 9:44

Solved git gitlab continuous-integration dvc

My team has a set up wherein we track datasets and models in DVC, and have a GitLab repository for tracking our code and DVC metadata files. We have a job in our dev GitLab pipeline (run on each push to a merge request) that has the goal of checking to be sure that the developer remembered to run dvc push to keep DVC remote storage up-to-date. Right now, the way we do this is by running dvc pull on the GitLab runner, which will fail with errors telling you which files (new files or latest versions of existing files) were not found.

The downside to this approach is that we are loading the entirety of our data stored in DVC onto a GitLab runner, and we've run into out-of-memory issues, not to mention lengthy run time to download all that data. Since the path and md5 hash of the objects are stored in the DVC metadata files, I would think that's all the information that DVC would need to be able to answer the question "is the remote storage system up-to-date".

It seems like dvc status is similar to what I'm asking for, but compares the cache or workspace and remote storage. In other words, it requires the files to actually be present on whatever filesystem is making the call.

Is there some way to achieve the goal I laid out above ("inform the developer that they need to run dvc push") without pulling everything from DVC?

Dorathydorca answered 28/5, 2021 at 20:10 Comment(0)

It seems like dvc status is similar to what I'm asking for

dvc status --cloud will give you a list of "new" files if they that haven't been pushed to the (default) remote. It won't error out though, so your CI script should fail depending on the stdout message.

More info: https://dvc.org/doc/command-reference/status#options

I'd also ask everyone to run dvc install, which will setup some Git hooks, including automatic dvc push with git push.

See https://dvc.org/doc/command-reference/install

Gryphon answered 29/5, 2021 at 3:9 Comment(1)

Thanks, Jorge. The dvc install suggestion is particularly interesting. – Dorathydorca 3/6, 2021 at 14:7

Following Jorge Orpinel Pérez answer:

dvc status --cloud will give you a list of "new" files if they that haven't been pushed to the (default) remote. It won't error out though, so your CI script should fail depending on the stdout message.

You can use dvc status --cloud -q

-q, --quiet - do not write anything to standard output. Exit with 0 if data and pipelines are up to date, otherwise 1.

Perreira answered 8/8, 2023 at 9:44 Comment(0)

Recommended topics

Hot tags