Difference between git-lfs and dvc
Asked Answered
C

2

36

What is the difference between these two? We used git-lfs in my previous job and we are starting to use dvc alongside git in my current one. They both place some kind of index instead of file and can be downloaded on demand. Has dvc some improvements over the former one?

Charleencharlemagne answered 24/10, 2019 at 12:19 Comment(0)
F
13

DVC is a better replacement for git-lfs.

Unlike git-lfs, DVC doesn't require installing a dedicated server; It can be used on-premises (NAS, SSH, for example) or with any major cloud provider (S3, Google Cloud, Azure).

For more information: https://dvc.org/doc/use-cases/data-and-model-files-versioning

Fuqua answered 24/10, 2019 at 13:54 Comment(6)
Yep! In fact there's a section in the DVC docs explaining these differences: dvc.org/doc/understanding-dvc/…Byrnes
More information is here: discuss.dvc.org/t/… towardsdatascience.com/…Scathe
@JorgeOrpinel's link has moved to dvc.org/doc/user-guide/…Alcine
It moved to dvc.org/doc/use-cases/versioning-data-and-model-files actually (but the old one redirects so it's OK). That other link is also relevant though, Sam, thanks!Byrnes
This explanation seems one-sided. You provided one advantage, but knowing that git is widely supported, would I run into issues trying to get DVC to work in my favourite IDE? Would I need the whole team to install additional software to get DVC to work? Will my team/contributors have to learn a new API or workflow? These may seem like minor concerns when looking at the technical side of the tool, but it may weight heavily when making a decision in a large organization.Deidradeidre
I think using dvc for lfs usecase: the steps for using dvc is almost similar to lfs. The only extra step i see is installing dvc and adding remote storage. With this extra step also comes the extra benefits.Antakiya
V
50

DVC is not better than git-lfs: they are quite different. The selected answer is largely biased. Both are simply different tools, for different purposes.

  • git-lfs is intended to be transparent to git, therefore it requires a customized server. Its learning process is short and fast. Some configuration commands, and bang! it is running, storing large files independently of the git repository. That's its only function, and it does it fine. Having an additional server is not a drawback, but instead a requirement for such transparency. Once configured, files are just handled by git, by means of git hooks (endpoints that are activated after git operations).
  • dvc is intended to provide independent management of large files for the final user. What dvc basically does is this: it just makes git ignore the files that you wish to control (adding them to .gitignore) and instead, it generates an additional file with the same name and the extension .dvc. So, in order to push a commit with its corresponding files, the user is required to manually "add" (equivalent to git commit, not to git add; there's no equivalent for the git stage in dvc) and "push" to both systems. This is not a drawback, but a necessary level of control. In exchange, the remote large-files-holder is just any remote filesystem, accessible directly by its path, via ssh or via multiple drivers (google drive, amazon, etc.). Anyway, hooks are also available for dvc, which would simplify the use of large files, if having additional files is not annoying to one, and saving files to the remote would require additional operations, remember that they are .gitignored! So, if you modify a file stored in dvc, such change will not be noticed by git status, and you might lose such change, except if you make the additional check with dvc.

DVC has a different purpose than git-lfs. DVC is used not only to save large files, but mainly to manage large files that are the result of deterministic processes. So, in addition to storing large files, dvc also controls processing pipelines, like Makefiles do, by defining dependencies in a Makefile, and if the processing inputs (which are also files or parameters tracked by dvc) change, dvc calculates which files must be regenerated (yes, like Makefiles). That's why DVC is usually described as makefile tool for data science. That's cool if you are generating big AI models or heavy data files, in large quantities. The exact equivalent as compiling large applications: every localized change implies just compiling a small portion of the whole.

Personally, I use both for large-file storage. git-lfs simplifies large files management (typical case: building an AI docker container with a large model file inside, while having a small git repo, without almost no git knowledge, while dvc requires some), but dvc simplifies large-file storage (which eases administration, for example, I can easily find and delete a file that I don't want in the DVC repository, which I can't/it's complex with git-lfs), at the cost of not having such transparency, having sometimes lost data. I still don't use dvc for pipelines calculation, until now I've preferred my own implementations. DVC is getting better, perhaps I will use it more in the future. Both are just different; I currently use both, according to the purpose.

Valenta answered 24/3, 2021 at 8:15 Comment(0)
F
13

DVC is a better replacement for git-lfs.

Unlike git-lfs, DVC doesn't require installing a dedicated server; It can be used on-premises (NAS, SSH, for example) or with any major cloud provider (S3, Google Cloud, Azure).

For more information: https://dvc.org/doc/use-cases/data-and-model-files-versioning

Fuqua answered 24/10, 2019 at 13:54 Comment(6)
Yep! In fact there's a section in the DVC docs explaining these differences: dvc.org/doc/understanding-dvc/…Byrnes
More information is here: discuss.dvc.org/t/… towardsdatascience.com/…Scathe
@JorgeOrpinel's link has moved to dvc.org/doc/user-guide/…Alcine
It moved to dvc.org/doc/use-cases/versioning-data-and-model-files actually (but the old one redirects so it's OK). That other link is also relevant though, Sam, thanks!Byrnes
This explanation seems one-sided. You provided one advantage, but knowing that git is widely supported, would I run into issues trying to get DVC to work in my favourite IDE? Would I need the whole team to install additional software to get DVC to work? Will my team/contributors have to learn a new API or workflow? These may seem like minor concerns when looking at the technical side of the tool, but it may weight heavily when making a decision in a large organization.Deidradeidre
I think using dvc for lfs usecase: the steps for using dvc is almost similar to lfs. The only extra step i see is installing dvc and adding remote storage. With this extra step also comes the extra benefits.Antakiya

© 2022 - 2024 — McMap. All rights reserved.