Is the git binary diff algorithm (delta storage) standardized?
Asked Answered
F

3

62

Git uses a delta compression to store objects that are similar to each-other.

Is this algorithm standardized and used in other tools as well? Is there documentation describing the format? Is it compatible with xdelta/VCDIFF/RFC 3284?

Forestall answered 28/2, 2012 at 7:30 Comment(0)
H
72

I think the diff algo used for pack files was linked to one of the delta encoding out there: initially (2005) xdelta, and then libXDiff.
But then, as detailed below, it shifted to a custom implementation.

Anyway, as mentioned here:

Git does deltification only in packfiles.
But when you push via SSH git would generate a pack file with commits the other side doesn't have, and those packs are thin packs, so they also have deltas... but the remote side then adds bases to those thin packs making them standalone.

(note: creating many packfiles, or retrieving information in huge packfile is costly, and explain why git doesn't handle well huge files or huge repo.
See more at "git with large files")

This thread also reminds us:

Actually packfiles and deltification (LibXDiff, not xdelta) was, from what I remember and understand, originally because of network bandwidth (which is much more costly than disk space), and I/O performance of using single mmapped file instead of very large number of loose objects.

LibXDiff is mentioned in this 2008 thread.

However, since then, the algo has evolved, probably in a custom one, as this 2011 thread illustrates, and as the header of diff-delta.c points out:

So, strictly speaking, the current code in Git doesn't bear any resemblance with the libxdiff code at all.
However the basic algorithm behind both implementations is the same
.
Studying the libxdiff version is probably easier in order to gain an understanding of how this works.

/*
 * diff-delta.c: generate a delta between two buffers
 *
 * This code was greatly inspired by parts of LibXDiff from Davide Libenzi
 * http://www.xmailserver.org/xdiff-lib.html
 *
 * Rewritten for GIT by Nicolas Pitre <[email protected]>, (C) 2005-2007
 */

More on the packfiles the Git Book:

packfile format


Git 2.18 adds to the delta description in this new documentation section, which now (Q2 2018) states:

Object types

Valid object types are:

  • OBJ_COMMIT (1)
  • OBJ_TREE (2)
  • OBJ_BLOB (3)
  • OBJ_TAG (4)
  • OBJ_OFS_DELTA (6)
  • OBJ_REF_DELTA (7)

Type 5 is reserved for future expansion. Type 0 is invalid.

Deltified representation

Conceptually there are only four object types: commit, tree, tag and blob.
However to save space, an object could be stored as a "delta" of another "base" object.
These representations are assigned new types ofs-delta and ref-delta, which is only valid in a pack file.

Both ofs-delta and ref-delta store the "delta" to be applied to another object (called 'base object') to reconstruct the object.
The difference between them is,

  • ref-delta directly encodes 20-byte base object name.
    • If the base object is in the same pack, ofs-delta encodes the offset of the base object in the pack instead.

The base object could also be deltified if it's in the same pack.
Ref-delta can also refer to an object outside the pack (i.e. the so-called "thin pack"). When stored on disk however, the pack should be self contained to avoid cyclic dependency.

The delta data is a sequence of instructions to reconstruct an object from the base object.
If the base object is deltified, it must be converted to canonical form first. Each instruction appends more and more data to the target object until it's complete.
There are two supported instructions so far:

  • one for copy a byte range from the source object and
  • one for inserting new data embedded in the instruction itself.

Each instruction has variable length. Instruction type is determined by the seventh bit of the first octet. The following diagrams follow the convention in RFC 1951 (Deflate compressed data format).

Hydrometallurgy answered 28/2, 2012 at 8:25 Comment(5)
The final algo might be a custom one, when I read a 2011 thread like git.661346.n2.nabble.com/diff-ing-files-td6446460.htmlHydrometallurgy
In 2008, libXDiff was apparently used: git.661346.n2.nabble.com/…Hydrometallurgy
That 2011 thread is a good link. Choice quote: "So, strictly speaking, the current code in Git doesn't bear any resemblance with the libxdiff code at all. However the basic algorithm behind both implementations is the same."Forestall
I wonder if they just changed to compression algorithm from libxdiff to custom, or if the on-disk format also changed (which sounds like trouble).Forestall
@Thilo: I have included the 2011 thread in the answer for more visibility.Hydrometallurgy
L
25

Git delta encoding is copy/insert based.

This means that the derived file is encoded as a sequence of opcodes which can represent copy instructions(eg: copy from the base file y bytes starting from offset x into the target buffer) or insert instructions(eg: insert the next x bytes into the target buffer).

As a very simple example(taken from the paper 'File System Support for Delta Compression'), consider that we want to create a delta buffer to transform the text "proxy  cache" into "cache  proxy". The resulting instructions should be:

  1. Copy 5 bytes from offset 7 (copy 'cache' from base buffer)
  2. Insert two spaces
  3. Copy 5 bytes from offset 0 (copy 'proxy' from base buffer)

Which translated to git's encoding becomes:

(bytes 1-3 represent the first instruction)

  • 0x91 (10010001), which is split into
    • 0x80 (10000000)(most significant bit set makes this a 'copy from base to output' instruction)
    • 0x01 (00000001)(means 'advance one byte and use it as the base offset)
    • 0x10 (00010000)(advance one byte and use it as length)
  • 0x07 (offset)
  • 0x05 (length)

(bytes 4-6 represent the second instruction)

  • 0x02 (since the MSB is not set, this means 'insert the next two bytes into the output')
  • 0x20 (space)
  • 0x20 (space)

(bytes 7-8 represent the last instruction)

  • 0x90 (10010000), which is split into
    • 0x80 (10000000)(means 'copy')
    • 0x10 (00010000)(advance one byte and use it as length)
  • 0x05 (length)

Notice that in the last copy instruction does not specify an offset which means offset 0. Other bits in the copy opcode can also be set when bigger offsets/lengths are needed.

The result delta buffer has in this example has 8 bytes, which is not much of a compression since the target buffer has 12 bytes, but when this encoding applied to large text files it can make a huge difference.

I have recently pushed a node.js library to github which implements both diff/patch functions using git delta encoding. The code should be more readable and commented than the one in git source, which is heavily optimized.

I also have written a few tests that explain the output opcodes used in each example with a format similar to the above.

Laminar answered 13/1, 2013 at 13:30 Comment(1)
The following article also contains some useful information: stefan.saasen.me/articles/…Rooky
R
5

Is this algorithm standardized and used in other tools as well?

The pack format is part of a public API: the transfer protocols used for push and fetch operations use it to send less data over the network.

They are implemented in at least two other major Git implementations besides the reference: JGit and libgit2.

Therefore, it is very unlikely that there will be backwards incompatible changes to the format, and can be considered to be "standardized" in that sense.

This amazing file from the docs describes the heuristics used in the pack algorithm as a funny commentary on an email by Linus: https://github.com/git/git/blob/v2.9.1/Documentation/technical/pack-heuristics.txt

Rooky answered 13/12, 2014 at 18:27 Comment(2)
Good point (and more current than my "historical" answer). +1Hydrometallurgy
@Hydrometallurgy Thanks! This question is quite open ended, and your answer and Thiago's both contain useful insights as well. It makes me happy just to have an answer next to that of other great programmers like you guys. :)Rooky

© 2022 - 2024 — McMap. All rights reserved.