Git objects SHA-1 are file contents or file names?
Asked Answered
B

2

4

I am confused with how a file's actual contents are stored in .git.

For e.g. Version 1 is the actual text content in test.txt. When I commit (first commit) it to the repo, git returns a SHA-1 for that file which is located in .git\objects\0c\15af113a95643d7c244332b0e0b287184cd049.

When I open the file 15af113a95643d7c244332b0e0b287184cd049 in a text editor, it's all garbage, something like this

x+)JMU074f040031QÐKÏ,ÉLÏË/Je¨}ºõw[Éœ„ÇR­ ñ·Î}úyGª*±8#³¨,1%>9?¯$5¯D¯¤¢„áôÏ3%³þú>š~}Ž÷*ë²-¶ç¡êÊòR“KâKòãs+‹sô

But I'm not sure whether this garbage represents the encrypted form of the text Version 1 or it's represented by the SHA-1 15af113a95643d7c244332b0e0b287184cd049.

Bantling answered 10/6, 2017 at 17:2 Comment(4)
git-scm.com/book/en/v2/Git-Internals-Git-ObjectsAguiar
that's what I am going through now but I am still not clear on the file contents part and thus I had to ask here..Bantling
Can you edit your question to be about some specific aspect of the description under the "Object Storage" header at that link that's hard to understand?Pyrite
Explore your repository objects with the command git cat-file -p [sha1] and you will understand better...Lang
I
5

The correct answer to the question in the subject line:

Git objects SHA-1 are file contents or file names?

is probably "neither", since you were referring to the contents of the loose object file, rather than the original file—and even if you were referring to the original file, that's still not quite right.

A loose object, in Git, is a plain file. The name of the file is constructed from the object's hash ID. The object's hash ID, in turn, is constructed by computing a hash of the object's contents with a prefix header attached.

The prefixed header depends on the object type. There are four types: blob, commit, tag, and tree. The header consists of the a zero-terminated byte string composed of the type name as an ASCII (or equivalently, UTF-8) byte string, followed by a space, followed by a decimalized representation of the size of the object in bytes, followed by an ASCII NUL (b'\x00' in Python, if you prefer modern Python notation, or '\0' if you prefer C).

After the header come the actual object contents. So, for a file containing the byte string b'hello\n', the data to be hashed consist of b'blob 6\0hello\n:

$ echo 'hello' | git hash-object -t blob --stdin
ce013625030ba8dba906f756967f9e9ca394464a
$ python3
[...]
>>> import hashlib
>>> s = b'blob 6\0hello\n'
>>> hashlib.sha1(s).hexdigest()
'ce013625030ba8dba906f756967f9e9ca394464a'

Hence, the file name that would be used to store this file is (derived from) ce013625030ba8dba906f756967f9e9ca394464a. As a loose object, it becomes .git/objects/ce/013625030ba8dba906f756967f9e9ca394464a.

The contents of that file, however, are the zlib-compressed form of b'blob 6\0hello\n' (with, apparently, level=1—the default is currently 6 and the result does not match at that level; it's not clear whether Git's zlib deflate exactly matches Python's, but using level 1 did work here):

$ echo 'hello' | git hash-object -w -t blob --stdin
ce013625030ba8dba906f756967f9e9ca394464a
$ vis .git/objects/ce/013625030ba8dba906f756967f9e9ca394464a
x\^AK\M-J\M-IOR0c\M-HH\M-M\M-I\M-I\M-g\^B\000\^]\M-E\^D\^T$

(note that the final $ is the shell prompt again; now back to Python3)

>>> import zlib
>>> zlib.compress(s, 1)
b'x\x01K\xca\xc9OR0c\xc8H\xcd\xc9\xc9\xe7\x02\x00\x1d\xc5\x04\x14'
>>> import vis
>>> print(vis.vis(zlib.compress(s, 1)))
x\^AK\M-J\M-IOR0c\M-HH\M-M\M-I\M-I\M-g\^B\^@\^]\M-E\^D\^T

where vis.py is:

def vischr(byte):
    "encode characters the way vis(1) does by default"
    if byte in b' \t\n':
        return chr(byte)
    # control chars: \^X; del: \^?
    if byte < 32 or byte == 127:
        return r'\^' + chr(byte ^ 64)
    # printable characters, 32..126
    if byte < 128:
        return chr(byte)
    # meta characters: prefix with \M^ or \M-
    byte -= 128
    if byte < 32 or byte == 127:
        return r'\M^' + chr(byte ^ 64)
    return r'\M-' + chr(byte)

def vis(bytestr):
    "same as vis(1)"
    return ''.join(vischr(c) for c in bytestr)

(vis produces an invertible but printable encoding of binary files; it was my 1993-ish answer to problems with cat -v).

Note that the names of files stored in a Git repository (under a commit) appear only as path name components stored in individual tree objects. Computing the hash ID of a tree object is nontrivial; I have Python code that does this in my public "scripts" repository under githash.py.

Inclined answered 10/6, 2017 at 19:36 Comment(7)
At least this time, it is more "concise" and to the point. +1Bonfire
Thanks for a good illustrative answer but can u plz elaborate on this Note that the names of files stored in a Git repository (under a commit) appear only as path name components stored in individual tree objects. plsBantling
A commit identifies a tree object (try git cat-file -p HEAD to see a commit and its tree object). The tree object itself is stored in binary format (see the githash.py code I linked) but you can convert it to printable text with another git cat-file -p command. If the tree has object ID 1234567, git cat-file -p 1234567 will show it, for instance.Inclined
@torek, as per your answer, i tried this. $ git cat-file -p 16ed output: 100644 blob a7036e7253f5e7099e8b68c2fe99ecf5f8b013d3 pom.xml . And then, cat pom.xml | git hash-object -t blob output: af86ccbdb1f4f10685ffe85cf68372109694e49a. pom.xml has only one revision (not sure how to express this in git terms) so there is no chance that I am referring to a different version of pom.xml while using git-hash-object . Why is the SHA-1 not matching.Carillo
@samshers: Do you have filters (clean and smudge filters) and/or crlf hacking turned on? If so, use <filter> pom.xlm | git hash-object -t blob because the tree and index hashes are from the filtered contents, not the work-tree contents. (Insert the appropriate command, whatever that may be, as the filter—e.g., if the work-tree copy has CRLF endings, you could use tr -d '\015' < pom.xml as the filter.)Inclined
@Inclined - impeccable. $ cat pom.xml | tr -d '\015' | git hash-object -t blob --stdin outputs a7036e7253f5e7099e8b68c2fe99ecf5f8b013d3 . But another thing I would like to clarify - so the SHA-1 is computed from uncompressed content. But what's actually in the file (in object's db) is mostly compressed content. Right.Carillo
@samshers: yes, the hash is over the uncompressed content (including blob <size>\0 bytes). The actual in-Git data are either a loose object (zlib compressed) or a packed object (possibly deltified). (oops, I forgot that this answer had the Python example in it!)Inclined
B
3

Git Magic mentions:

By the way, the files within .git/objects are compressed with zlib so you should not stare at them directly. Filter them through zpipe -d, or type (using git cat-file):

$ git cat-file -p .git/objects/0c/15af113a95643d7c244332b0e0b287184cd049

With zpipe:

$ ./zpipe -d < .git/objects/0c/15af113a95643d7c244332b0e0b287184cd049

Note: for zpipe, I had to compile zpipe.c first:

sudo apt-get install zlib1g-dev
cd /usr/share/doc/zlib1g-dev/examples
sudo gunzip zpipe.c.gz
sudo gcc -o zpipe zpipe.c -lz

Then:

$ /usr/share/doc/zlib1g-dev/examples/zpipe -d < /usr/share/doc/zlib1g-dev/examples/zpipe -d <

You will get a result like:

vonc@VONCAVN7:/mnt/d/git/seec$ /usr/share/doc/zlib1g-dev/examples/zpipe -d < .git/objects/0d/b6225927ef60e21138a9762c41ea0db714ca0d
blob 2142 <full content there...>

You see a header composed of the type and content size, followed by the actual content.

See "Understanding Git Internals" from Jeff Kunkle, slide 8, for an illustration of a blob actual content:

Jeff Kunkle

Bonfire answered 10/6, 2017 at 17:47 Comment(2)
so what's inside the SHA-1 named file is compressed form(using zlib) of "header+content". While the SHA-1 is computed only using "header+content". Right???Carillo
@Carillo As torek says: the compression+delta is only for storing the object. SHA1 is for referencing the (uncompressed) content (+ header). Soon this will be using SHA-256: https://mcmap.net/q/12344/-why-doesn-39-t-git-use-more-modern-shaBonfire

© 2022 - 2024 — McMap. All rights reserved.