When and how does git use deltas for storage?
Asked Answered
W

2

35

Reading git's documentation one of the things they stress a lot is that git stores snapshots and not deltas. Since I saw a course on Git saying that Git stores differences between versions of files I tried the following: I initialized a git repository on an empty folder, created a file lorem.txt containing some lorem ipsum text staged the file and commited.

Then using find .git/objects -type f on command line I listed what git saved on the objects folder and as expected found a commit object pointing to a tree object pointing to a blob object containing the lorem ispum text I saved.

Then I modified the lorem ipsum text, adding more content to it, staged this change and commited. Listing again the files, I could see now the new commit object, pointing to a new three object and to a new blob object. Using git cat-file -p 331cf0780688c73be429fa602f9dd99f18b36793 I could see the contents of the new blob. They were exactly the contents of the full lorem.txt file, the old contents plus the change.

This works as expected by the documentation: git stores snapshots, not deltas. However, searching on the internet I found this SO question. On th accepted answer we see the following:

While that's true and important on the conceptual level, it is NOT true at the storage level.

Git does use deltas for storage.

Not only that, but it's more efficient in it than any other system. Because it does not keep per-file history, when it wants to do delta-compression, it takes each blob, selects some blobs that are likely to be similar (using heuristics that includes the closest approximation of previous version and some others), tries to generate the deltas and picks the smallest one. This way it can (often, depends on the heuristics) take advantage of other similar files or older versions that are more similar than the previous. The "pack window" parameter allows trading performance for delta compression quality. The default (10) generally gives decent results, but when space is limited or to speed up network transfers, git gc --aggressive uses value 250, which makes it run very slow, but provide extra compression for history data.

Which says that Git does use deltas for storage. As I understand from this, Git doesn't use deltas all the time, but only when it detects it is necessary. Is this true?

I placed a lot of lorem text on the file, so that it's 2mb in size. I thought that when making a small change to a big text file Git would automatically use deltas, but as I said it didn't.

When Git use deltas and how this works out?

Wert answered 29/1, 2015 at 19:13 Comment(4)
see git gc or git repackMarin
Git uses delta compression to efficiently store the objects when it creates pack files. This is an implementation detail about the compression algorithm used -- it's completely immaterial to using git.Ease
See also git-scm.com/book/en/v1/Git-Internals-PackfilesEase
Its object-storage layer does delta compression when its heuristics say the payoff for heavy-duty compression will be gratifying or when you explicitly tell it to.Rachmaninoff
P
27

Git only uses deltas in "packfiles". Initially, each git object is written as a separate file (as you found). Later, git can pack many objects into one file, called a "pack file". The pack file is then compressed, which automatically exploits any repetitions between the files in the packfile (or repetitions inside files).

This packing is performed by git repack. You can see it in action by invoking it manually. If you run git repack -ad on a git repo, you should see used disk space and number of files under .git/objects drop, as files are combined into packs and compressed.

In practice, you don't usually need to run git repack. Git by default regularly runs git gc, which in turn runs git repack when necessary. So relax, git has your back :-).

The excellent "git book" also has a chapter on packfiles with more explanations: http://git-scm.com/book/en/v2/Git-Internals-Packfiles .

Payoff answered 29/1, 2015 at 22:45 Comment(0)
G
6

Git 2.18 (Q2 2018) documents delta usage in Documentation/technical/pack-format

See commit 011b648 (11 May 2018) by Nguyễn Thái Ngọc Duy (pclouds).
(Merged by Junio C Hamano -- gitster -- in commit b577198, 23 May 2018)

pack-format.txt: more details on pack file format

The current document mentions OBJ_* constants without their actual values. A git developer would know these are from cache.h but that's not very friendly to a person who wants to read this file to implement a pack file parser.

Similarly, the deltified representation is not documented at all (the "document" is basically patch-delta.c). Translate that C code to English with a bit more about what ofs-delta and ref-delta mean.

So the documentation now states:

Object types

Valid object types are:

  • OBJ_COMMIT (1)
  • OBJ_TREE (2)
  • OBJ_BLOB (3)
  • OBJ_TAG (4)
  • OBJ_OFS_DELTA (6)
  • OBJ_REF_DELTA (7)

Type 5 is reserved for future expansion. Type 0 is invalid.

Deltified representation

Conceptually there are only four object types: commit, tree, tag and blob.
However to save space, an object could be stored as a "delta" of another "base" object.
These representations are assigned new types ofs-delta and ref-delta, which is only valid in a pack file.

Both ofs-delta and ref-delta store the "delta" to be applied to another object (called 'base object') to reconstruct the object.
The difference between them is,

  • ref-delta directly encodes 20-byte base object name.
    • If the base object is in the same pack, ofs-delta encodes the offset of the base object in the pack instead.

The base object could also be deltified if it's in the same pack.
Ref-delta can also refer to an object outside the pack (i.e. the so-called "thin pack"). When stored on disk however, the pack should be self contained to avoid cyclic dependency.

The delta data is a sequence of instructions to reconstruct an object from the base object.
If the base object is deltified, it must be converted to canonical form first. Each instruction appends more and more data to the target object until it's complete.
There are two supported instructions so far:

  • one for copy a byte range from the source object and
  • one for inserting new data embedded in the instruction itself.

Each instruction has variable length. Instruction type is determined by the seventh bit of the first octet. The following diagrams follow the convention in RFC 1951 (Deflate compressed data format).


With Git 2.20 (Q4 2018), malformed or crafted data in packstream can make our code attempt to read or write past the allocated buffer and abort, instead of reporting an error, which has been fixed.

t5303: use printf to generate delta bases

The exact byte count of the delta base file is important.
The test-delta helper will feed it to patch_delta(), which will barf if it doesn't match the size byte given in the delta.
Using "echo" may end up with unexpected line endings on some platforms (e.g,. "\r\n" instead of just "\n").

This actually wouldn't cause the test to fail (since we already expect test-delta to complain about these bogus deltas), but would mean that we're not exercising the code we think we are.

Let's use printf instead (which we already trust to give us byte-perfect output when we generate the deltas).


With Git 2.25 (Q1 2020), in a repository with many packfiles, the cost of the procedure that avoids registering the same packfile twice was unnecessarily high by using an inefficient search algorithm, which has been corrected.

See commit ec48540 (27 Nov 2019) by Colin Stolley (ccstolley).
(Merged by Junio C Hamano -- gitster -- in commit 6d831b8, 16 Dec 2019)

packfile.c: speed up loading lots of packfiles

Signed-off-by: Colin Stolley
Helped-by: Jeff King

When loading packfiles on start-up, we traverse the internal packfile list once per file to avoid reloading packfiles that have already been loaded. This check runs in quadratic time, so for poorly maintained repos with a large number of packfiles, it can be pretty slow.

Add a hashmap containing the packfile names as we load them so that the average runtime cost of checking for already-loaded packs becomes constant.

Add a perf test to p5303 to show speed-up.

The existing p5303 test runtimes are dominated by other factors and do not show an appreciable speed-up.
The new test in p5303 clearly exposes a speed-up in bad cases.
In this test we create 10,000 packfiles and measure the start-up time of git rev-parse, which does little else besides load in the packs.

Here are the numbers for the new p5303 test:

Test                         HEAD^             HEAD
---------------------------------------------------------------------
5303.12: load 10,000 packs   1.03(0.92+0.10)   0.12(0.02+0.09) -88.3%

[jc: squashed the change to call hashmap in install_packed_git() by peff]
Signed-off-by: Junio C Hamano

Goop answered 23/5, 2018 at 20:3 Comment(2)
Am I right saying that ref to an object outside the pack is used when server sends the pack to the client during git fetch? The client needs then to tell the server beforehand the tip (hash) of the branch it wants to fetch.Springy
@Springy Regarding what the client tells to the server what it wants to fetch, see: https://mcmap.net/q/12455/-git-fetch-for-many-files-is-slow-against-a-high-latency-diskGoop

© 2022 - 2024 — McMap. All rights reserved.