How does git LFS track and store binary data more efficiently than git?
How does git LFS efficiently track binary files?
Summary
It doesn't. It inefficiently tracks large binary files. It simply does it remotely, on a separate server is all, to free up some local storage space, and to make the initial git clone
ing process download much less data initially. Here's the gist of it:
@John Zwinck:
With Git LFS, the file data are not stored in the repository, so when you clone the repository it does not have to download the complete history of the files stored in LFS. Only the "current" version of each LFS file is downloaded from the LFS server. Technically, LFS files are downloaded during "checkout" rather than "clone."
@Schwern:
- It can drastically reduce the size of the initial git clone of a repository.
- It can drastically reduce the size of the local repository.
@Mark Bramnik:
The idea is that the binaries are downloaded from the "remote" repository lazily, namely during the checkout process rather than during cloning or fetching.
Details
Regular Git repo
Imagine you have a massive mono-repo with about 100 GB of text files (code, including all git blobs and changes), and 100 GB of binary data. Note that this is a realistic, representative example I actually dealt with for a few years. If the 100 GB of binary data has been committed once, it takes up 100 GB, and your total git repo is 100 GB of code blobs + 100 GB of binary data committed once = 200 GB.
If the 100 GB of binary data has been changed 10 times for each file, however, then it takes up ~100 GB x (1 + 10) = 1.1 TB of space, + the 100 GB of code --> 1.2 TB repo size. Now, clone this repo:
# this downloads 1.2 TB of data
git clone [email protected]:MyUsername/MyRepo.github.io.git
If you want to do a git checkout
, however it's fast! All of the binary data is stored locally in your repo, since you have all 11 snapshots (the initial file + 10 changes) of the binary data!
# this downloads 0 bytes of data;
# takes **seconds**; you already have the binary data locally, so no new data is
# downloaded from the remote server
git checkout some_past_commit
# this takes seconds and downloads 0 bytes of new data as well
git checkout another_past_commit
Contrast this to git lfs
:
A Git repo using Git LFS for all binary file storage
You have the same repo as above, except only the 100 GB of code is in the git repo. Git LFS causes git to just store pointer text files to the LFS server, so the stuff in the git repo is only the 100 GB of code + a tiny bit of storage for the pointer files.
The Git LFS server, on the other hand, contains all 1.1 TB of binary files. So, you get this effect:
# this downloads 0.1 TB (100 GB) of code/text data
git clone [email protected]:MyUsername/my_repo.github.io.git
# this downloads 0.1 TB (100 GB) of binary data--just the most-recent snapshot
# of all 100 GB of binary data on Git LFS
cd my_repo
git lfs pull
# this downloads potentially up to another 0.1 TB (100 GB) of data;
# takes **hours**; you do NOT have the binary data for all snapshots stored
# locally, so at **checkout** Git LFS causes your system to download all new
# LFS data!
git checkout some_past_commit
# this downloads up to another 0.1 TB (100 GB) of data, taking **more hours**
git checkout another_past_commit
Actually, regular Git stores binary blobs more efficiently than Git LFS
See the table in @Alexander Gogl's answer here. Adding a 28.8 MB Vectorworks (.vwx) file takes 26.5 MB as a git blob, and 26.5 MB as a Git LFS blob. But, if you store it as a git blob and then run git gc
to perform "garbage collection" and blob compression, regular git shrinks it to 1.8 MB. Git LFS doesn't do anything to it. See the other examples in this table too.
If you look at this table, you'll see that git overall stores more-efficiently than Git LFS:
type |
change |
file |
as git blob |
after git gc |
as git-lfs blob |
Vectorworks (.vwx) |
added geometry |
28,8 MB |
+26,5 MB |
+1,8 MB |
+26,5 MB |
GeoPackage (.gpkg) |
added geometry |
16,9 MB |
+3,7 MB |
+3,5 MB |
+16,9 MB |
Affinity Photo (.afphoto) |
toggled layers |
85,8 MB |
+85,6 MB |
+0,8 MB |
+85,6 MB |
FormZ (.fmz) |
added geometry |
66,3 MB |
+66,3 MB |
+66,3 MB |
+66,3 MB |
Photoshop (.psd) |
toggled layers |
25,8 MB |
+15,8 MB |
+15,4 MB |
+25,8 MB |
Movie (mp4) |
trimmed |
13,1 MB |
+13,2 MB |
+0 MB |
+13,1 MB |
delete a file |
|
-13,1 MB |
+0 MB |
+0 MB |
+0 MB |
Pros and cons of Git LFS
Supposed pros of Git LFS:
- The initial cloning of the repo is faster, since it clones only light-weight pointers to the binary data.
- The local repository size is smaller.
Cons of Git LFS:
git checkout
now has to download the binary data, which might be 27 GB and take 3+ hours to finish the git checkout
. And if you stop it early, you lose it all.
- This could happen multiple times in succession, each time you run
git checkout
and Git LFS needs to download more data.
- You have to have an active, high-speed internet connection to perform a
git checkout
. (In normal git, a git checkout
is performed offline with no internet connection).
- Binary file storage actually is less efficient than regular Git (see table above).
Note: you can periodically clean your Git LFS data that isn't used for the current checkout with git lfs prune
. See my answer here: How to shrink your .git folder in your git repo.
When does normal git
download files from the internet? A detailed look at git fetch
vs git pull
, and how Git LFS differs
This may not be well-understood, so I think I should add this section on how normal git works. When I use the term "download", I mean from the internet.
Regular git
only downloads files from the internet when you do git clone
, git fetch
, or git pull
. And if you are checked-out on branch main
for instance, git pull
is just a git fetch
(an online command which downloads/"fetches" all branches from the remote server to your local PC, including remote branch origin main
to its locally-stored, remote-tracking hidden copy called origin/main
on your local PC) followed by a git merge origin/main
(an offline command which does not download). Cloning is only done to initially download the repo from the internet, so that online git command only occurs once per repo, so let's talk about git fetch
more, below.
But first, let's talk about branches. For every branch you have, you actually have 3 branches. For your main
branch, for instance, you have:
- your locally-stored non-hidden
main
branch,
- your locally-stored remote-tracking hidden branch named
origin/main
, which is shown when you run git branch -r
, and
- your remote branch named
main
which is on the remote server named origin
.
- To see your remotes and their URLs, run
git remote -v
.
- To see your truly-remote branch, you must open a web browser and navigate to it online on GitHub, Gitlab, or Bitbucket, for instance. Doing
git fetch && git checkout origin/main
, for instance, simply shows you a locally-stored, remote-tracking copy of it.
git fetch
downloads all remote branches to their hidden locally-stored origin/branch_name
branch counterparts, including downloading your remote main
branch changes to your locally-stored, remote-tracking hidden branch named origin/main
. git fetch
is when remote changes are downloaded. If you then run git checkout main
followed by git merge origin/main
, no new data is downloaded in either of those commands. Rather, the already-downloaded data in your locally-stored remote-tracking hidden branch origin/main
is just merged into your locally-stored non-hidden main
branch when you do git merge origin/main
. In regular git, a git checkout
is an offline command, simply updating your local file-system with all files from your locally-stored already-downloaded git database blobs stored locally within your .git
directory.
So, let's recap and go over some more examples:
# ONLINE command: download remote server data to all of your locally-stored
# remote-tracking hidden "origin/*" branches (including `origin/main`).
# This downloads ALL branches on the remote, not just `main`.
git fetch
# ONLINE command: download remote server data to only your locally-stored
# remote-tracking hidden "origin/main" branch. This does NOT download the
# other branches, in this case, only branch `main` from the remote server
# named `origin`.
git fetch origin main
# ONLINE command: perform an online `git fetch origin main` to update
# `origin/main`, followed by an offline merge of `origin/main` into `main`.
# So, this one command is the equivalent of these 3 commands:
#
# git fetch origin main # ONLINE command
# git checkout main # OFFLINE command
# git merge origin/main # OFFLINE command
#
git fetch origin main:main
# OFFLINE command: update your local file-system to match a given
# already-downloaded state
git checkout main
# OFFLINE command: merge your already-downloaded remote-tracking hidden branch,
# `origin/main`, into `main`.
git merge origin/main
# ONLINE command: perform a `git fetch origin main`, which is an online command,
# followed by `git merge origin/main`, which is an offline command. This one
# command is the equivalent of these two commands:
#
# git fetch origin main # ONLINE command
# git merge origin/main # OFFLINE command
#
git pull main
Contrast this with Git LFS: git checkout
, when using git lfs
, now becomes an online command, downloading any online binary files stored in git lfs
from your remote online server, rather than copying them from your locally-stored data in main
or origin/main
, for instance. And, that's why in a massive repo a few-second git checkout
now becomes a several hour git checkout
. And, that is why I hate and don't recommend Git LFS. I need my git checkout
s to remain offline commands which take seconds, rather than to become online commands which take hours, so that I can get my 8 hours of work done in an 8-hour day, rather than requiring a 12 to 16 hour day where half of that is wasted.
I experienced the above time waste for three years in a professional company as a remote worker working in their massive (~120 GB) mono-repo, and have extensive negative experience with this. The right solution is to at least give me a larger SSD (ex: 2~8 TB, rather than 512 GB), so that I can have a script to run git lfs fetch --all
to download all Git LFS data for all branches every night, overnight. But, just switching to regular git
might be even better, even for storing large binary objects.
Addendum: Note also that Git LFS only downloads from the internet the data if it is not already cached locally in the .git/lfs
dir. So, on a fresh repo where you are on main
, git checkout A
would go to the internet and download LFS data for branch A
, caching it in .git/lfs/
. git checkout B
would then go to the internet and download the B
branch LFS data. Running git checkout A
again would then retrieve the locally-cached data from .git/lfs
without going to the internet again, since it's already cached.
For a little more info. on the .git/lfs
dir, see my answer here: How to shrink your .git folder in your git repo.
To mitigate the above behavior of git checkout
becoming an online command, you can have a cronjob run git lfs fetch --all
periodically--perhaps once per night--if your hard drive has enough space, so that it will pre-fetch Git LFS data once per night into your local .git/lfs
dir. See my answer here: What is the difference between git lfs fetch
, git lfs fetch --all
, and git lfs pull
?. But, if you have a large enough hard drive to do that, even better: don't use Git LFS at all in the first place, since its only selling point is that it tries to save local hard drive space by not downloading the entire repo using the normal Git online commands, since Git LFS turns real files, per your Git LFS configuration settings, into file pointers.
Other references:
- For where I first learned about the 3 git branches, including the locally-stored remote-tracking hidden
origin/*
branches, see this answer here: How do I delete a Git branch locally and remotely?, and my several comments beneath it, starting here.
See also
- My question: Update (after additional learning since writing this question): don't use
git lfs
. I now recommend against using git lfs
- All of the "see also" links at the bottom of my question.
- My Q&A: What is the difference between
git lfs fetch
, git lfs fetch --all
, and git lfs pull
?
git lfs
might solve the GitHub space limitation issue, it wouldn't solve the "git checkout
takes forever" issue that anyone separated from the remote server (ex: all remote employees) would still see. So, I'm not talking about only GitHub's implementation. I'm talking about Git LFS in general. – Atkinsgit lfs
for 3 years in a 1200-developer org in a mono repo that was around 200 GB, with 100 GB being ingit lfs
, and every single flippin' week, if not day, simply doinggit fetch
andgit checkout main
, orgit checkout my_branch_from_yesterday
, or similar, would take up to 3 hours for the checkout alone, sincegit lfs
adds hooks to pullgit lfs
data when you dogit checkout
. This is because someone on the AI perception team would add a bunch of camera data or something togit lfs
, & my checkout would download it. – Atkinsgit checkout
s, than a 1 TB SSD, which is what I was allotted, with a 200 GB repo and 700 GB of build data, that takes 3 hours every day when I need to change branches to look at something (via a normally-benigngit checkout
). – Atkins