How does git LFS track and store binary data more efficiently than git?
Asked Answered
A

2

4

I know that git LFS causes git to store a string "pointer" in a text file, and then git LFS downloads that target binary file. In this way, git repos are smaller on the remote git server. But, git LFS still has to store the binary files, so it seems to me that the storage locally (after a git lfs pull) is no different, and the combined sum of the remote git LFS server data plus the remote git data would still be similar.

What am I missing? How does git LFS efficiently track binary files?


Update (after additional learning since writing this question): don't use git lfs. I now recommend against using git lfs

See also:

  1. my comments below the answer I accepted
  2. my own answer I just added below

I began with this question because I believed Git LFS was amazing and wonderful and I wanted to know how. Instead, I ended up realizing Git LFS was the cause of my daily workflow problems and that I shouldn't use it nor recommend it anymore.

Summary:

As I state here:

For personal, free GitHub accounts, it is way too limiting, and for paid, corporate accounts, it makes git checkout go from taking a few seconds to up to 3+ hours, especially for remote workers, which is a total waste of their time. I dealt with that for three years and it was horrible. I wrote a script to do a git lfs fetch once per night to mitigate this, but my employer refused to buy me a bigger SSD to give me enough space to do git lfs fetch --all once per night, so I still ran into the multi-hour-checkout problem frequently. It's also impossible to undo the integration of git lfs into your repo unless you delete your whole GitHub repo and recreate it from scratch.

Details:

I just discovered that the free version of git lfs has such strict limits that it's useless, and I'm now in the process of removing it from all my public free repos. See this answer (Repository size limits for GitHub.com) and search for the "git lfs" parts.

It seems to me that the only benefit of git lfs is that it avoids downloading a ton of data all at once when you clone a repo. That's it! That seems like a pretty minimal, if not useless, benefit for any repo which has a total content size (git repo + would-be git lfs repo) < 2 TB or so. All that using git lfs does is

  1. make git checkout take forever (literally hours) (bad)
  2. make my normally-fast-and-offline git commands, like git checkout now become online-and-slow git commands (bad), and
  3. act as another GitHub service to pay for (bad).

If you're trying to use git lfs to overcome GitHub's 100 MB max file size limit, like I was, don't! You'll run out of git lfs space almost instantly, in particular if anyone clones or forks your repo, as that counts against your limits, not theirs! Instead, "a tool such as tar plus split, or just split alone, can be used to split a large file into smaller parts, such as 90 MB each" (source), so that you can then commit those binary file chunks to your regular git repo.

Lastly, the "solution" on GitHub to stop using git lfs and totally free up that space again is absolutely crazy nuts! You have to delete your entire repo! See this Q&A here: How to delete a file tracked by git-lfs and release the storage quota?

GitHub's official documentation confirms this (emphasis added):

After you remove files from Git LFS, the Git LFS objects still exist on the remote storage and will continue to count toward your Git LFS storage quota.

To remove Git LFS objects from a repository, delete and recreate the repository. When you delete a repository, any associated issues, stars, and forks are also deleted.

I can't believe this is even considered a "solution". I really hope they're working on a better fix for it.

Suggestion to employers and corporations considering using git lfs:

Quick summary: don't use git lfs. Buy your employees bigger SSDs instead. If you do end up using git lfs, buy your employees bigger SSDs anyway, so they can run a script to do git lfs fetch --all once per night while they are sleeping.

Details:

Let's say you're a tech company with a massive mono-repo that is 50 GB in size, and binary files and data that you'd like to be part of the repo which are 4 TB in size. Rather than giving them insufficient 500 GB ~ 2 TB SSDs and then resorting to git lfs, which makes git checkouts go from seconds to hours when done on home internet connections, get your employees bigger solid state drives instead! A typical tech employee costs you > $1000/day (5 working days per week x 48 working weeks/year x $1000/day = $240k, which is less than their salary + benefits + overhead costs). So, a $1000 8 TB SSD is totally worth it if it saves them hours of waiting and hassle! Examples to buy:

  1. 8TB Sabrent Rocket M.2 SSD, $1100
  2. 8TB Inland M.2 SSD, $900

Now they will hopefully have enough space to run git lfs fetch --all in an automated nightly script to fetch LFS contents for all remote branches to help mitigate (but not solve) this, or at least git lfs fetch origin branch1 branch2 branch3 to fetch the contents for the hashes of their most-used branches.

See also

  1. Really insightful Q&A which also leans towards not using git lfs [even for remote repos]: Do I need Git LFS for local repos?
  2. What is the advantage of git lfs?
  3. My Q&A: How to resume git lfs post-checkout hook after failed git checkout
  4. My answer: How to shrink your .git folder in your git repo
  5. My Q&A: What is the difference between git lfs fetch, git lfs fetch --all, and git lfs pull?
Atkins answered 6/4, 2023 at 6:16 Comment(6)
Please edit your answer to make it clear you are referring only to GitHub's implementation of git lfs and not git lfs in general. Hosting your own GitLab instance could be an elegant solution to this problem.Heiser
@LaviArzi, while self hosting git lfs might solve the GitHub space limitation issue, it wouldn't solve the "git checkout takes forever" issue that anyone separated from the remote server (ex: all remote employees) would still see. So, I'm not talking about only GitHub's implementation. I'm talking about Git LFS in general.Atkins
Sorry for the misunderstanding then. But isn't the issue you're talking about relevant only if you need the ability to go back to previous versions on a whim? If all I'm doing is regular collaborative work things should be fine in that case, fetching&checking out whenever a collaborator makes a change, and pushing whenever you make a changeHeiser
@LaviArzi, no, it's an issue even in normal workflows. I used git lfs for 3 years in a 1200-developer org in a mono repo that was around 200 GB, with 100 GB being in git lfs, and every single flippin' week, if not day, simply doing git fetch and git checkout main, or git checkout my_branch_from_yesterday, or similar, would take up to 3 hours for the checkout alone, since git lfs adds hooks to pull git lfs data when you do git checkout. This is because someone on the AI perception team would add a bunch of camera data or something to git lfs, & my checkout would download it.Atkins
I'd rather have a 4 TB SSD with a 2 TB repo all local, that pulls nightly, and 30 second git checkouts, than a 1 TB SSD, which is what I was allotted, with a 200 GB repo and 700 GB of build data, that takes 3 hours every day when I need to change branches to look at something (via a normally-benign git checkout).Atkins
"It seems to me that the only benefit of git lfs is that it avoids downloading a ton of data all at once when you clone a repo." The more serious issue that git lfs circumvents is preventing the repo to grow in size very much in case there are often changes to binary files. If you only add some blobs once and then never change, then it's not really a big deal. Accumulating the history of changes to such files is what can cause big problems.Soidisant
A
11

When you clone a Git repository, you have to download a compressed copy of its entire history. Every version of every file is accessible to you.

With Git LFS, the file data are not stored in the repository, so when you clone the repository it does not have to download the complete history of the files stored in LFS. Only the "current" version of each LFS file is downloaded from the LFS server. Technically, LFS files are downloaded during "checkout" rather than "clone."

So Git LFS is not as much about storing large files efficiently as it is about avoid downloading unneeded versions of selected files. That history is often not very interesting anyway, and if you need an older version, Git can connect to the LFS server and get it. This is by contrast to regular Git which lets you checkout any commit offline.

Asis answered 6/4, 2023 at 6:26 Comment(6)
Note that with modern git (both server & client must support it) the first sentence is no longer true. You can get a similar effect as using LFS by using a blobless clone: you'll get a fully functional repository that's smaller than a full one and will download missing things on-demand.Pincas
@JoachimSauer: Good point. LFS may still have an advantage for people who want to choose which files are downloaded on-demand vs not, or which files are stored on a dedicated LFS server vs on the Git server.Asis
Oh yeah, there may still be reasons to pick LFS, but it's no longer the only game in town.Pincas
I just discovered that the free version of git lfs has such strict limits that it's useless, and I'm now in the process of removing it from all my public free repos. See this answer (Repository size limits for GitHub.com) and search for the "git lfs" parts.Atkins
John, it seems that the only benefit of git lfs then is that it avoids downloading a ton of data all at once, right? That seems like a pretty minimal, if not useless, benefit for any repo which has a total content size (git repo + would-be git lfs repo) < 200 GB. All that using git lfs does is 1) make git checkout take forever (literally hours) (bad), 2) make my normally-fast-and-offline git commands, like git checkout now become online-and-slow git commands (bad), and 3) act as another GitHub service to pay for (bad).Atkins
I'm glad you've documented those limitations here, but I think we should be clear that they are limitations of Git LFS on GitHub and not necessarily Git LFS in general. I have never actually seen anyone using Git LFS on a free account on GitHub, maybe this is why.Asis
A
3

How does git LFS track and store binary data more efficiently than git?

How does git LFS efficiently track binary files?

Summary

It doesn't. It inefficiently tracks large binary files. It simply does it remotely, on a separate server is all, to free up some local storage space, and to make the initial git cloneing process download much less data initially. Here's the gist of it:

@John Zwinck:

With Git LFS, the file data are not stored in the repository, so when you clone the repository it does not have to download the complete history of the files stored in LFS. Only the "current" version of each LFS file is downloaded from the LFS server. Technically, LFS files are downloaded during "checkout" rather than "clone."

@Schwern:

  1. It can drastically reduce the size of the initial git clone of a repository.
  2. It can drastically reduce the size of the local repository.

@Mark Bramnik:

The idea is that the binaries are downloaded from the "remote" repository lazily, namely during the checkout process rather than during cloning or fetching.

Details

Regular Git repo

Imagine you have a massive mono-repo with about 100 GB of text files (code, including all git blobs and changes), and 100 GB of binary data. Note that this is a realistic, representative example I actually dealt with for a few years. If the 100 GB of binary data has been committed once, it takes up 100 GB, and your total git repo is 100 GB of code blobs + 100 GB of binary data committed once = 200 GB.

If the 100 GB of binary data has been changed 10 times for each file, however, then it takes up ~100 GB x (1 + 10) = 1.1 TB of space, + the 100 GB of code --> 1.2 TB repo size. Now, clone this repo:

# this downloads 1.2 TB of data
git clone [email protected]:MyUsername/MyRepo.github.io.git

If you want to do a git checkout, however it's fast! All of the binary data is stored locally in your repo, since you have all 11 snapshots (the initial file + 10 changes) of the binary data!

# this downloads 0 bytes of data;
# takes **seconds**; you already have the binary data locally, so no new data is
# downloaded from the remote server
git checkout some_past_commit

# this takes seconds and downloads 0 bytes of new data as well
git checkout another_past_commit

Contrast this to git lfs:

A Git repo using Git LFS for all binary file storage

You have the same repo as above, except only the 100 GB of code is in the git repo. Git LFS causes git to just store pointer text files to the LFS server, so the stuff in the git repo is only the 100 GB of code + a tiny bit of storage for the pointer files.

The Git LFS server, on the other hand, contains all 1.1 TB of binary files. So, you get this effect:

# this downloads 0.1 TB (100 GB) of code/text data
git clone [email protected]:MyUsername/my_repo.github.io.git
# this downloads 0.1 TB (100 GB) of binary data--just the most-recent snapshot
# of all 100 GB of binary data on Git LFS
cd my_repo
git lfs pull

# this downloads potentially up to another 0.1 TB (100 GB) of data;
# takes **hours**; you do NOT have the binary data for all snapshots stored
# locally, so at **checkout** Git LFS causes your system to download all new
# LFS data!
git checkout some_past_commit

# this downloads up to another 0.1 TB (100 GB) of data, taking **more hours**
git checkout another_past_commit

Actually, regular Git stores binary blobs more efficiently than Git LFS

See the table in @Alexander Gogl's answer here. Adding a 28.8 MB Vectorworks (.vwx) file takes 26.5 MB as a git blob, and 26.5 MB as a Git LFS blob. But, if you store it as a git blob and then run git gc to perform "garbage collection" and blob compression, regular git shrinks it to 1.8 MB. Git LFS doesn't do anything to it. See the other examples in this table too.

If you look at this table, you'll see that git overall stores more-efficiently than Git LFS:

type change file as git blob after git gc as git-lfs blob
Vectorworks (.vwx) added geometry 28,8 MB +26,5 MB +1,8 MB +26,5 MB
GeoPackage (.gpkg) added geometry 16,9 MB +3,7 MB +3,5 MB +16,9 MB
Affinity Photo (.afphoto) toggled layers 85,8 MB +85,6 MB +0,8 MB +85,6 MB
FormZ (.fmz) added geometry 66,3 MB +66,3 MB +66,3 MB +66,3 MB
Photoshop (.psd) toggled layers 25,8 MB +15,8 MB +15,4 MB +25,8 MB
Movie (mp4) trimmed 13,1 MB +13,2 MB +0 MB +13,1 MB
delete a file -13,1 MB +0 MB +0 MB +0 MB

Pros and cons of Git LFS

Supposed pros of Git LFS:

  1. The initial cloning of the repo is faster, since it clones only light-weight pointers to the binary data.
  2. The local repository size is smaller.

Cons of Git LFS:

  1. git checkout now has to download the binary data, which might be 27 GB and take 3+ hours to finish the git checkout. And if you stop it early, you lose it all.
    1. This could happen multiple times in succession, each time you run git checkout and Git LFS needs to download more data.
  2. You have to have an active, high-speed internet connection to perform a git checkout. (In normal git, a git checkout is performed offline with no internet connection).
  3. Binary file storage actually is less efficient than regular Git (see table above).

Note: you can periodically clean your Git LFS data that isn't used for the current checkout with git lfs prune. See my answer here: How to shrink your .git folder in your git repo.

When does normal git download files from the internet? A detailed look at git fetch vs git pull, and how Git LFS differs

This may not be well-understood, so I think I should add this section on how normal git works. When I use the term "download", I mean from the internet.

Regular git only downloads files from the internet when you do git clone, git fetch, or git pull. And if you are checked-out on branch main for instance, git pull is just a git fetch (an online command which downloads/"fetches" all branches from the remote server to your local PC, including remote branch origin main to its locally-stored, remote-tracking hidden copy called origin/main on your local PC) followed by a git merge origin/main (an offline command which does not download). Cloning is only done to initially download the repo from the internet, so that online git command only occurs once per repo, so let's talk about git fetch more, below.

But first, let's talk about branches. For every branch you have, you actually have 3 branches. For your main branch, for instance, you have:

  1. your locally-stored non-hidden main branch,
  2. your locally-stored remote-tracking hidden branch named origin/main, which is shown when you run git branch -r, and
  3. your remote branch named main which is on the remote server named origin.
    1. To see your remotes and their URLs, run git remote -v.
    2. To see your truly-remote branch, you must open a web browser and navigate to it online on GitHub, Gitlab, or Bitbucket, for instance. Doing git fetch && git checkout origin/main, for instance, simply shows you a locally-stored, remote-tracking copy of it.

git fetch downloads all remote branches to their hidden locally-stored origin/branch_name branch counterparts, including downloading your remote main branch changes to your locally-stored, remote-tracking hidden branch named origin/main. git fetch is when remote changes are downloaded. If you then run git checkout main followed by git merge origin/main, no new data is downloaded in either of those commands. Rather, the already-downloaded data in your locally-stored remote-tracking hidden branch origin/main is just merged into your locally-stored non-hidden main branch when you do git merge origin/main. In regular git, a git checkout is an offline command, simply updating your local file-system with all files from your locally-stored already-downloaded git database blobs stored locally within your .git directory.

So, let's recap and go over some more examples:

# ONLINE command: download remote server data to all of your locally-stored
# remote-tracking hidden "origin/*" branches (including `origin/main`). 
# This downloads ALL branches on the remote, not just `main`.
git fetch

# ONLINE command: download remote server data to only your locally-stored
# remote-tracking hidden "origin/main" branch. This does NOT download the
# other branches, in this case, only branch `main` from the remote server
# named `origin`.
git fetch origin main

# ONLINE command: perform an online `git fetch origin main` to update
# `origin/main`, followed by an offline merge of `origin/main` into `main`. 
# So, this one command is the equivalent of these 3 commands:
#
#       git fetch origin main  # ONLINE command
#       git checkout main      # OFFLINE command
#       git merge origin/main  # OFFLINE command
#
git fetch origin main:main

# OFFLINE command: update your local file-system to match a given
# already-downloaded state
git checkout main

# OFFLINE command: merge your already-downloaded remote-tracking hidden branch,
# `origin/main`, into `main`.
git merge origin/main

# ONLINE command: perform a `git fetch origin main`, which is an online command,
# followed by `git merge origin/main`, which is an offline command. This one
# command is the equivalent of these two commands:
#
#       git fetch origin main  # ONLINE command
#       git merge origin/main  # OFFLINE command
#
git pull main

Contrast this with Git LFS: git checkout, when using git lfs, now becomes an online command, downloading any online binary files stored in git lfs from your remote online server, rather than copying them from your locally-stored data in main or origin/main, for instance. And, that's why in a massive repo a few-second git checkout now becomes a several hour git checkout. And, that is why I hate and don't recommend Git LFS. I need my git checkouts to remain offline commands which take seconds, rather than to become online commands which take hours, so that I can get my 8 hours of work done in an 8-hour day, rather than requiring a 12 to 16 hour day where half of that is wasted.

I experienced the above time waste for three years in a professional company as a remote worker working in their massive (~120 GB) mono-repo, and have extensive negative experience with this. The right solution is to at least give me a larger SSD (ex: 2~8 TB, rather than 512 GB), so that I can have a script to run git lfs fetch --all to download all Git LFS data for all branches every night, overnight. But, just switching to regular git might be even better, even for storing large binary objects.

Addendum: Note also that Git LFS only downloads from the internet the data if it is not already cached locally in the .git/lfs dir. So, on a fresh repo where you are on main, git checkout A would go to the internet and download LFS data for branch A, caching it in .git/lfs/. git checkout B would then go to the internet and download the B branch LFS data. Running git checkout A again would then retrieve the locally-cached data from .git/lfs without going to the internet again, since it's already cached.

For a little more info. on the .git/lfs dir, see my answer here: How to shrink your .git folder in your git repo.

To mitigate the above behavior of git checkout becoming an online command, you can have a cronjob run git lfs fetch --all periodically--perhaps once per night--if your hard drive has enough space, so that it will pre-fetch Git LFS data once per night into your local .git/lfs dir. See my answer here: What is the difference between git lfs fetch, git lfs fetch --all, and git lfs pull?. But, if you have a large enough hard drive to do that, even better: don't use Git LFS at all in the first place, since its only selling point is that it tries to save local hard drive space by not downloading the entire repo using the normal Git online commands, since Git LFS turns real files, per your Git LFS configuration settings, into file pointers.

Other references:

  1. For where I first learned about the 3 git branches, including the locally-stored remote-tracking hidden origin/* branches, see this answer here: How do I delete a Git branch locally and remotely?, and my several comments beneath it, starting here.

See also

  1. My question: Update (after additional learning since writing this question): don't use git lfs. I now recommend against using git lfs
    1. All of the "see also" links at the bottom of my question.
  2. My Q&A: What is the difference between git lfs fetch, git lfs fetch --all, and git lfs pull?
Atkins answered 27/6, 2023 at 18:26 Comment(6)
As far as I know, git checkout updates only the files in the working tree to match the version of the index you are working on, but not older or other versions, or am I mistaken? Why do you think that git checkout without enabling lfs would download fewer files? Or do you see the problem in having to download uncompressed lfs-tracked files instead of diff-compressed files?Jokester
@AlexanderGogl, to help explain this, I just added a whole section titled "When does normal git download files from the internet?" to my answer. Please read that. Normally, git checkout is an offline command that never downloads any files from the internet. The online git commands which download files are git clone, git fetch, and git pull. And if you are on branch main for instance, git pull is just a git fetch origin main (an online command which downloads) followed by a git merge origin/main (an offline command which does not download). See my new section for details.Atkins
Thank you for the detailed clarification of the commands underlying procedures. I wasn't aware of that! Are you saying that if I switch between branch A to B and back to A in a git lfs enabled repository, then git will first fetch the blobs of branch A from the remote, then the one from B and then again the one from A? Why wouldn't it just use the lfs blobs that are already in the local repository?Jokester
@AlexanderGogl, Git LFS only downloads from the internet the data if it is not already cached locally in the .git/lfs dir. So, on a fresh repo where you are on main, git checkout A would go to the internet and download LFS data for A, caching it in .git/lfs/. git checkout B would then go to the internet and download the B branch LFS data. Running git checkout A again would then retrieve the locally-cached data from .git/lfs without going to the internet again, since it's already cached.Atkins
@AlexanderGogl, for a little more info. on the .git/lfs dir, see my answer here: How to shrink your .git folder in your git repo.Atkins
@AlexanderGogl, I updated my answer again, adding an "Addendum" near the end.Atkins

© 2022 - 2024 — McMap. All rights reserved.