How does Git compute file hashes?
Asked Answered
I

7

143

The SHA1 hashes stored in the tree objects (as returned by git ls-tree) do not match the SHA1 hashes of the file content (as returned by sha1sum):

$ git cat-file blob 4716ca912495c805b94a88ef6dc3fb4aff46bf3c | sha1sum
de20247992af0f949ae8df4fa9a37e4a03d7063e  -

How does Git compute file hashes? Does it compress the content before computing the hash?

Isley answered 29/8, 2011 at 1:37 Comment(4)
See assigning Git SHA1's without GitSudan
For more details, also see progit.org/book/ch9-2.htmlIsley
netvope's link seems to be dead now. I think this is the new location: git-scm.com/book/en/Git-Internals-Git-Objects which is §9.2 from git-scm.com/bookMaddi
Related: What is the file format of a git commit object?Fibula
C
141

Git prefixes the object with "blob ", followed by the length (as a human-readable integer), followed by a NUL character

$ echo -e 'blob 14\0Hello, World!' | shasum 8ab686eafeb1f44702738c8b0f24f2567c36da6d

Source: http://alblue.bandlem.com/2011/08/git-tip-of-week-objects.html

Claypoole answered 29/8, 2011 at 1:42 Comment(9)
Also worth mentioning that it replaces "\r\n" with "\n", but leaves isolated "\r"s alone.Au
^correction to above comment: sometimes git does the replacement above, depending on one's eol/autocrlf settings.Au
You can also compare this to the output of echo 'Hello, World!' | git hash-object --stdin. Optionally you can specify --no-filters to make sure no crlf conversion happens, or specify --path=somethi.ng to let git use the filter specified via gitattributes (also @user420667). And -w to actually submit the blob to .git/objects (if you are in a git repo).Modulator
Expressing the equivalence, to make sense: echo -e 'blob 16\0Hello, \r\nWorld!' | shasum == echo -e 'Hello, \r\nWorld!' | git hash-object --stdin --no-filters and it will be also equivalent with \n and 15.Sievert
Shouldn't the length be 13 and not 14?Emlynne
@amn the nul character \0 gets counted as a character.Claypoole
That's strange -- I've digested the text blob <length>\0 where <length> is file size in bytes, followed by the file's contents, and the result matches what git hash-object --no-filters <file-path> gives me. When I count the null byte as an additional byte for <length> the hashes are obviously no longer equal.Emlynne
echo appends a newline to the output, which is also passed into git. That's why its 14 characters. To use echo without a newline, write echo -n 'Hello, World!'Danette
Wouldn't this cause two files with the same contents to have the same hash (because the string being hashed only varies with the contents and length of the file)?Hematite
H
37

I am only expanding on the answer by @Leif Gruenwoldt and detailing what is in the reference provided by @Leif Gruenwoldt

Do It Yourself..

  • Step 1. Create an empty text document (name does not matter) in your repository
  • Step 2. Stage and Commit the document
  • Step 3. Identify the hash of the blob by executing git ls-tree HEAD
  • Step 4. Find the blob's hash to be e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
  • Step 5. Snap out of your surprise and read below

How does GIT compute its commit hashes

    Commit Hash (SHA1) = SHA1("blob " + <size_of_file> + "\0" + <contents_of_file>)

The text blob⎵ is a constant prefix and \0 is also constant and is the NULL character. The <size_of_file> and <contents_of_file> vary depending on the file.

See: What is the file format of a git commit object?

And thats all folks!

But wait!, did you notice that the <filename> is not a parameter used for the hash computation? Two files could potentially have the same hash if their contents are same indifferent of the date and time they were created and their name. This is one of the reasons Git handles moves and renames better than other version control systems.

Do It Yourself (Ext)

  • Step 6. Create another empty file with a different filename in the same directory
  • Step 7. Compare the hashes of both your files.

Note:

The link does not mention how the tree object is hashed. I am not certain of the algorithm and parameters however from my observation it probably computes a hash based on all the blobs and trees (their hashes probably) it contains

Herrod answered 5/3, 2015 at 15:35 Comment(2)
SHA1("blob" + <size_of_file> - is there additional space character between blob and size? Is size decimal? Is it zero-prefixed?Montpellier
@Montpellier There is. The reference and my testing confirms so. I've corrected the answer. Size seems to be number of bytes as integer with no prefix.Outhaul
T
19

git hash-object

This is a quick way to verify your test method:

s='abc'
printf "$s" | git hash-object --stdin
printf "blob $(printf "$s" | wc -c)\0$s" | sha1sum

Output:

f2ba8f84ab5c1bce84a7b441cb1959cfc7093b7f
f2ba8f84ab5c1bce84a7b441cb1959cfc7093b7f  -

where sha1sum is in GNU Coreutils.

Then it comes down to understanding the format of each object type. We have already covered the trivial blob, here are the others:

Tremble answered 16/5, 2016 at 22:53 Comment(3)
As mentioned in a previous answer, the length should rather be calculated as $(printf "\0$s" | wc -c). Note the added empty character. That is, if the string is 'abc' with the added empty character in front the length would yield 4, not 3. Then the results with sha1sum matches git hash-object.Petunia
You're right they do match. It seems that there's a bit of a pernicious side effect from using printf rather than echo -e here. When you apply git hash-object to a file containing the string 'abc' you get 8baef1b...f903 which is what you get when using echo -e rather than printf. Provided that echo -e adds a newline at the end of a string it seems that to match the behavior with printf you can do the same (i.e. s="$s\n").Petunia
upvote for using printf rather than echo -eChihuahua
O
4

I needed this for some unit tests in Python 3 so thought I'd leave it here.

def git_blob_hash(data):
    if isinstance(data, str):
        data = data.encode()
    data = b'blob ' + str(len(data)).encode() + b'\0' + data
    h = hashlib.sha1()
    h.update(data)
    return h.hexdigest()

I stick to \n line endings everywhere but in some circumstances Git might also be changing your line endings before calculating this hash so you may need a .replace('\r\n', '\n') in there too.

Outhaul answered 14/5, 2017 at 12:30 Comment(0)
B
3

Based on Leif Gruenwoldt answer, here is a shell function substitute to git hash-object :

git-hash-object () { # substitute when the `git` command is not available
    local type=blob
    [ "$1" = "-t" ] && shift && type=$1 && shift
    # depending on eol/autocrlf settings, you may want to substitute CRLFs by LFs
    # by using `perl -pe 's/\r$//g'` instead of `cat` in the next 2 commands
    local size=$(cat $1 | wc -c | sed 's/ .*$//')
    ( echo -en "$type $size\0"; cat "$1" ) | sha1sum | sed 's/ .*$//'
}

Test:

$ echo 'Hello, World!' > test.txt
$ git hash-object test.txt
8ab686eafeb1f44702738c8b0f24f2567c36da6d
$ git-hash-object test.txt
8ab686eafeb1f44702738c8b0f24f2567c36da6d
Bemoan answered 27/6, 2016 at 15:5 Comment(0)
T
0

This is a python3 version for binary hash calculation (the above example is for text)

For purpose of readability put this code in your own def. Also note, the code is a snippet, not a complete script. For your inspiration.

    targetSize: int
exists: bool
if os.path.exists(targetFile):
    exists = True
    targetSize = os.path.getsize(targetFile)
else:
    exists = False
    targetSize = 0
openMode: str
if exists:
    openMode = 'br+'
else:
    openMode = 'bw+'
with open(targetFile, openMode) as newfile:
    if targetSize > 0:
        header: str = f"blob {targetSize}\0"
        headerBytes = header.encode('utf-8')
        headBytesLen = len(headerBytes)
        buffer = bytearray(headBytesLen + targetSize)
        buffer[0:0+headBytesLen] = headerBytes
        buffer[headBytesLen:headBytesLen+targetSize] = newfile.read()
        sha1Hash = hashlib.sha1(buffer).hexdigest()
        if not sha == sha1Hash:
            newfile.truncate()
        else:
            continue
    with requests.get(fullFile) as response2:            
        newfile.write(response2.content)
Tortile answered 10/7, 2023 at 9:25 Comment(0)
I
0

Git 2.45 (Q2 2024), batch 10 now offers an official documentation on this.

See commit 28636d7 (12 Mar 2024) by Dirk Gouders (dgouders-whs).
(Merged by Junio C Hamano -- gitster -- in commit 509a047, 21 Mar 2024)

Documentation/user-manual.txt: example for generating object hashes

Signed-off-by: Dirk Gouders

Add a simple example on how object hashes can be generated manually.

Further, because the document suggests to have a look at the initial commit, clarify that some details changed since that time.

user-manual now includes in its man page:

for 'file' (the earliest versions of Git hashed slightly differently but the conclusion is still the same).

The following is a short example that demonstrates how these hashes can be generated manually:

Let's assume a small text file with some simple content:

$ echo "Hello world" >hello.txt

We can now manually generate the hash Git would use for this file:

  • The object we want the hash for is of type "blob" and its size is 12 bytes.

  • Prepend the object header to the file content and feed this to sha1sum:

$ { printf "blob 12\0"; cat hello.txt; } | sha1sum
802992c4220de19a90767f3000a79a31b98d0df7  -

That manually constructed hash can be verified using git hash-object which of course hides the addition of the header:

$ git hash-object hello.txt
802992c4220de19a90767f3000a79a31b98d0df7
Isocrates answered 1/4, 2024 at 3:43 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.