Why does Git use a cryptographic hash function?
Asked Answered
D

1

160

Why does Git use SHA-1, a cryptographic hash function, instead of a faster non-cryptographic hash function?

Related question:

Stack Overflow question Why does Git use SHA-1 as version numbers? asks why Git uses SHA-1 as opposed to sequential numbers for commits.

Deception answered 1/3, 2015 at 11:1 Comment(4)
Personally I think that even using broken SHA-1 over SHA-2 was premature optimization.Tamarau
@CodesInChaos: and besides, baking any particular algorithm into the code was a horrible violation of DI principles. Should be in an XML config file somewhere ;-)Holdover
Update Dec. 2017 with Git 2.16 (Q1 2018): an effort to support an alternative SHA is underway: see "Why doesn't Git use more modern SHA?".Sholapur
There are no good 160-bit or higher non-crypto hashes. Most are highly optimized 32-bit, 64-bit or 128-bit functions. 128-bit is alright, but I get the feeling that 128-bit is a bit low for a big project like git. If a fast, high-quality 224/256-bit hash came out, it would probably be ideal.Mahogany
S
227

TLDR;


You can check that from Linus Torvalds himself, when he presented Git to Google back in 2007:
(emphasis mine)

We check checksums that is considered cryptographically secure. Nobody has been able to break SHA-1, but the point is, SHA-1 as far as git is concerned, isn't even a security feature. It's purely a consistency check.
The security parts are elsewhere. A lot of people assume since git uses SHA-1 and SHA-1 is used for cryptographically secure stuff, they think that it's a huge security feature. It has nothing at all to do with security, it's just the best hash you can get.

Having a good hash is good for being able to trust your data, it happens to have some other good features, too, it means when we hash objects, we know the hash is well distributed and we do not have to worry about certain distribution issues.

Internally it means from the implementation standpoint, we can trust that the hash is so good that we can use hashing algorithms and know there are no bad cases.

So there are some reasons to like the cryptographic side too, but it's really about the ability to trust your data.
I guarantee you, if you put your data in git, you can trust the fact that five years later, after it is converted from your harddisc to DVD to whatever new technology and you copied it along, five years later you can verify the data you get back out is the exact same data you put in. And that is something you really should look for in a source code management system.


Update Dec. 2017 with Git 2.16 (Q1 2018): this effort to support an alternative SHA is underway: see "Why doesn't Git use more modern SHA?".


I mentioned in "How would git handle a SHA-1 collision on a blob?" that you could engineer a commit with a particular SHA1 prefix (still an extremely costly endeavor).
But the point remains, as Eric Sink mentions in "Git: Cryptographic Hashes" (Version Control by Example (2011) book:

It is rather important that the DVCS never encounter two different pieces of data which have the same digest. Fortunately, good cryptographic hash functions are designed to make such collisions extremely unlikely.

It is harder to find good non-cryptographic hash with low collision rate, unless you consider research like "Finding State-of-the-Art Non-cryptographic Hashes with Genetic Programming".

You can also read "Consider use of non-cryptographic hash algorithm for hashing speed-up", which mentions for instance "xxhash", an extremely fast non-cryptographic Hash algorithm, working at speeds close to RAM limits.


Discussions around changing the hash in Git are not new:

(Linus Torvalds)

There's not really anything remaining of the mozilla code, but hey, I started from it. In retrospect I probably should have started from the PPC asm code that already did the blocking sanely - but that's a "20/20 hindsight" kind of thing.

Plus hey, the mozilla code being a horrid pile of crud was why I was so convinced that I could improve on things. So that's a kind of source for it, even if it's more about the motivational side than any actual remaining code ;)

And you need to be careful about how to measure the actual optimization gain

(Linus Torvalds)

I pretty much can guarantee you that it improves things only because it makes gcc generate crap code, which then hides some of the P4 issues.

(John Tapsell - johnflux)

The engineering cost for upgrading git from SHA-1 to a new algorithm is much higher. I'm not sure how it can be done well.

First of all we probably need to deploy a version of git (let's call it version 2 for this conversation) which allows there to be a slot for a new hash value even though it doesn't read or use that space -- it just uses the SHA-1 hash value which is in the other slot.

That way once we eventually deploy yet a newer version of git, let's call it version 3, which produces SHA-3 hashes in addition to SHA-1 hashes, people using git version 2 will be able to continue to inter-operate.
(Although, per this discussion, they may be vulnerable and people who rely on their SHA-1-only patches may be vulnerable.)

In short, switching to any hash is not easy.


Update February 2017: yes, it is in theory possible to compute a colliding SHA1: shattered.io

How is GIT affected?

GIT strongly relies on SHA-1 for the identification and integrity checking of all file objects and commits.
It is essentially possible to create two GIT repositories with the same head commit hash and different contents, say a benign source code and a backdoored one.
An attacker could potentially selectively serve either repository to targeted users. This will require attackers to compute their own collision.

But:

This attack required over 9,223,372,036,854,775,808 SHA1 computations. This took the equivalent processing power as 6,500 years of single-CPU computations and 110 years of single-GPU computations.

So let's not panic just yet.
See more at "How would Git handle a SHA-1 collision on a blob?".

Sholapur answered 1/3, 2015 at 11:4 Comment(10)
It does seem like the recent crop of high quality non-cryptographic hash functions, like xxhash, came out just a little too late -- right after git.Deception
@Deception indeed. There have been discussion about replacing SHA1 with another hash, but it would simply require quite a bit of work, for something which, for now, is working fine.Sholapur
"we know the hash is well distributed and we do not have to worry about certain distribution issues" - why is this an issue for scm?Amar
@Amar the collision rate is low enough to be well-suited for an SCM where the data is generally not random but test files.Sholapur
Calculating the probability of a collision in a perfectly uniform hash.Interloper
@Amar usability comes to mind: in most cases Git can identify any commit with eight characters prefix, tops. That's not easy to achieve without a cryptographic hash function. Of course, there are other concerns as well.Outfield
Actually, there is a security reason for using a cryptographic hash. When an author (say Linus) wants to cut a release (of say Linux) people want to know the source code they download matches what the author intended to include in the release. To this end the last commit hash in the release is tagged and the tag is signed. If the commit hash chain ending in the tag were not cryptographically secure then the source could could be smudged to something other than what the author intended.Palliasse
@ChristopherKing Interesting. Would that "smudge" be horribly difficult to do though?Sholapur
Yep. The point of a cryptographic hash is to make it cryptographically difficult to discover a colliding hash.Palliasse
@ChristopherKing Indeed. I remember illustrating that point in https://mcmap.net/q/12622/-how-would-git-handle-a-sha-1-collision-on-a-blobSholapur

© 2022 - 2024 — McMap. All rights reserved.