Hash collision in git
Asked Answered
E

9

241

What would actually happen if I had a hash collision while using git?

E.g. I manage to commit two files with the same sha1 checksum, would git notice it or corrupt one of the files?

Could git be improved to live with that, or would I have to change to a new hash algorithm?

(Please do not deflect this question by discussing how unlikely that is - Thanks)

Eger answered 3/5, 2012 at 15:16 Comment(15)
I've been informed by the git Gods that the chances of a SHA1 collision is the same as the Earth being sucked up into the black hole created by the CERN accelerator. If this is indeed true, then there's no need for that extra memcmp. , source: lwn.net/Articles/307281Counterchange
@KurzedMetal: But SHA-1 is considered somewhat weak, so it could easily be vulnerable to a collision attack in the future.Tertullian
ABSOLUTELY NOT SO. To quote Dan Bernstein: "The fact that academics haven't carried out the SHA-1 collision attack yet is a minor historical accident" - now that the SHA-3 contest is over, there's a good chance the relevant people will turn their attention to using the known attack to produce a collision. Marc Stevens estimates the difficulty as a mere 2^61 operations. There will very likely be a SHA-1 collision exhibited soon; it's odd that it hasn't happened already.Hope
@KurzedMetal: There is a chance to create black hole in CERN (two protons would have collide accurately (10^-15m)), however this black hole would not suck Earth up, it would instantly evaporate due to Hawking radiation... So the chances of SHA1 collision are much bigger than being sucked up... just saying...Mythical
possible duplicate of How would git handle a SHA-1 collision on a blob?Sultry
All this talk about SHA1 being "weak" or "vulnerable" is irrelevant. Git doesn't use SHA1 hashing for security, it uses it for speed. The ability to produce a contrived hash collision doesn't matter.Sultry
@meager: The quality of the hash function does matter, but indeed not for "contrived" collisions (Git is immune against that). SHA1 was not just chosen for speed, but also for data integrity. Accidental collisions are a valid concern, even if they are discussed from confusing security perspectives, e.g. by Linus himselfHosanna
You'll probably find this answer useful ;) (I agree with the duplicate)Roussel
"I've been informed by the git Gods that the chances of a SHA1 collision is the same as the Earth being sucked up into the black hole created by the CERN accelerator." HOLY CRAP THE EARTH IS BEING SUCKED UP BY A BLACK HOLE!!!! :( :( shattered.it This was done in 2^63 operations, not 2^61 as seen above.Alla
It's astonishing that you specifically asked people not to discuss the unlikeliness of git collision, and almost everyone talked about the unlikeliness of git collision. These people should be banned from stackoverflow for life!Marquettamarquette
Replace sha1 binary with a one that produces a static string and see the effect.Discountenance
Update Dec. 2017 with Git 2.16 (Q1 2018): an effort to support an alternative SHA is underway: see "Why doesn't Git use more modern SHA?".Fulltime
Update August 2018 with Git 2.19 (Q3 2018): a new hash algorithm has been chosen: SHA-256. See the revised answer "Why doesn't Git use more modern SHA?".Fulltime
Does this answer your question? How would Git handle a SHA-1 collision on a blob?Francinafrancine
I lost count of how many times I've been told "That won't happen!" just to be blamed for it happening.Shawn
I
162

Picking atoms on 10 Moons

An SHA-1 hash is a 40 hex character string... that's 4 bits per character times 40... 160 bits. Now we know 10 bits is approximately 1000 (1024 to be exact) meaning that there are 1 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 different SHA-1 hashes... 1048.

What is this equivalent of? Well the Moon is made up of about 1047 atoms. So if we have 10 Moons... and you randomly pick one atom on one of these moons... and then go ahead and pick a random atom on them again... then the likelihood that you'll pick the same atom twice, is the likelihood that two given git commits will have the same SHA-1 hash.

Expanding on this we can ask the question...

How many commits do you need in a repository before you should start worrying about collisions?

This relates to so called "Birthday attacks", which in turn refers to the "Birthday Paradox" or "Birthday Problem", which states that when you pick randomly from a given set, you need surprisingly few picks before you are more likely than not to have picked something twice. But "surprisingly few" is a very relative term here.

Wikipedia has a table on the probability of Birthday Paradox collisions. There is no entry for a 40 character hash. But an interpolation of the entries for 32 and 48 characters lands us in the range of 5*1022 git commits for a 0.1% probability of a collision. That is fifty thousand billion billion different commits, or fifty Zettacommits, before you have reached even a 0.1% chance that you have a collision.

The byte sum of the hashes alone for these commits would be more data than all the data generated on Earth for a year, which is to say you would need to churn out code faster than YouTube streams out video. Good luck with that. :D

The point of this is that unless someone is deliberately causing a collision, the probability of one happening at random is so staggeringly small you can ignore this issue

"But when a collision does occur, then what actually happens?"

Ok, suppose the improbable does happen, or suppose someone managed to tailor a deliberate SHA-1 hash collision. What happens then?

In that case there is an excellent answer where someone experimented on it. I will quote from that answer:

  1. If a blob already exists with the same hash, you will not get any warnings at all. Everything seems to be ok, but when you push, someone clones, or you revert, you will lose the latest version (in line with what is explained above).
  2. If a tree object already exists and you make a blob with the same hash: Everything will seem normal, until you either try to push or someone clones your repository. Then you will see that the repo is corrupt.
  3. If a commit object already exists and you make a blob with the same hash: same as #2 - corrupt
  4. If a blob already exists and you make a commit object with the same hash, it will fail when updating the "ref".
  5. If a blob already exists and you make a tree object with the same hash. It will fail when creating the commit.
  6. If a tree object already exists and you make a commit object with the same hash, it will fail when updating the "ref".
  7. If a tree object already exists and you make a tree object with the same hash, everything will seem ok. But when you commit, all of the repository will reference the wrong tree.
  8. If a commit object already exists and you make a commit object with the same hash, everything will seem ok. But when you commit, the commit will never be created, and the HEAD pointer will be moved to an old commit.
  9. If a commit object already exists and you make a tree object with the same hash, it will fail when creating the commit.

As you can see some cases are not good. Especially cases #2 and #3 mess up your repository. However, it does seem that the fault stays within that repository, and the attack or bizarre improbability does not propagate to other repositories.

Also, it seems that the issue of deliberate collisions is being recognised as a real threat, and so for instance GitHub is taking measures to prevent it.

Intersex answered 23/4, 2014 at 19:9 Comment(8)
I don't know if the numbers are accurate, but gosh this is a great graphical way to describe the unlikelihood, and funny :)Megasporophyll
@UtkarshKumar Well the fun part will be to verify that the number of atoms is correct. Good luck in counting them. :DIntersex
The chance that a random commit of an actual text file collides is as good as zero, very unlikely. But this answer completely skips over the fact that somebody could try and deliberately create a collision. With the SHA-1 hash under attack, that is becoming a rather important factor.Reconnoitre
Reason for down voting: Very nicely said, but probability means absolutely nothing here. You can say the same about winning the lotto, but people win lotto here and there on a daily basis. So the lotto company can't really just say: the chance is small so we shouldn't have to worry about actually paying out the jackpot. The OP's question here is: what happens when that small chance occurs, and you failed to answer that.Marquettamarquette
@FukuzawaYukio There are not 2^48 lottery tickets printed, however -- only millions (maybe 200 million total per year.. who knows?), and there is a winning lottery. The probability is much higher, and for some lottery tickets, the winning ticket is always printed; so, the winner is inevitable (unless the winning ticket is accidentally misplaced). Also, I made a pseudo-realistic lottery ticket game many years ago: lottery.py. Needless to say, you lose 99% of the time.Why
What about the shortened hashes? I can do git log 9274 and it will show me the log starting from the commit with hash that begins with 9274. What if there were two hashes that begin with 9274?Planish
@Planish Well that should not ve too hard to try out. Make a program that commits random noise automatically and keep a record of shortened hashes. When you have a duplicate, should only requite a few thousand commits, try and see what happens.Intersex
-1 because this discusses a tangentially-related topic the author specifically mentions is not relevant. The part that actually answers the question is a copy-paste of another user's answer (and thereby work).Maladroit
D
71

If two files have the same hash sum in git, it would treat those files as identical. In the absolutely unlikely case this happens, you could always go back one commit, and change something in the file so they wouldn't collide anymore ...

See Linus Torvalds' post in the thread “Starting to think about sha-256?” in the git mailing list.

Donnelly answered 3/5, 2012 at 15:31 Comment(9)
"If two files have the same hash sum in git, it would treat those files as identical." This is actually a proper answer. However, do you have some source for this statement klaustopher? Your link is not working for me.Entitle
But this is not so absolutely unlikely if you work on a project with a collection of samples of hash collision.Nuthouse
@Doomjunky, Assign a number (say 1 through n) to each of your hash collision samples. Put a line at the top of each file with its unique number. They won't collide anymore, but whatever tools you're using to analyze them can just ignore the first line.Epigrammatist
A hash collision just happened to me (git merged two completely unrelated files)Streit
@Streit No it didn't. If you do have a proof of a hash collision you will have instant fame. Don't forget to post it! I'll send a crate of truly good Haarlem beer if you show me a full size SHA-1 hash collision created within Git within a week. Note that it must be a separate hash collision, not one already quoted elsewhere (not that anybody has posted one yet, but still).Reconnoitre
@Maarten Bodewes Well, it behaved like a hash collision. My guess is that it was a bug that made git see two different files as being empty files, which would give the same hash to both (the filename is not part of the hash calculation)Streit
+1 The only answer so far that actually answers the question. All the rest are just babbling about the "small chance" it might occurs, which every developer already knows.Marquettamarquette
@Streit "git merged two completely unrelated files" That is not how git behaves with a hash collisionTanana
Be very wary about Linus discussing IT security — He has been wrong before and he's wrong on this one. If one could create SHA-1 collisions at will, one could use it for all sorts of mayhem such as creating circular histories that cause Git servers and clients to crash.Fleurdelis
B
24

It's not really possible to answer this question with the right "but" without also explaining why it's not a problem. It's not possible to do that without really having a good grip on what a hash really is. It's more complicated than the simple cases you might have been exposed to in a CS program.

There is a basic misunderstanding of information theory here. If you reduce a large amount of information into a smaller amount by discarding some amount (ie. a hash) there will be a chance of collision directly related to the length of the data. The shorter the data, the LESS likely it will be. Now, the vast majority of the collisions will be gibberish, making them that much more likely to actually happen (you would never check in gibberish...even a binary image is somewhat structured). In the end, the chances are remote. To answer your question, yes, git will treat them as the same, changing the hash algorithm won't help, it'll take a "second check" of some sort, but ultimately, you would need as much "additional check" data as the length of the data to be 100% sure...keep in mind you would be 99.99999....to a really long number of digits.... sure with a simple check like you describe. SHA-x are cryptographically strong hashes, which means is't generally hard to intentionally create two source data sets that are both VERY SIMILAR to each other, and have the same hash. One bit of change in the data should create more than one (preferably as many as possible) bits of change in the hash output, which also means it's very difficult (but not quite impossible) to work back from the hash to the complete set of collisions, and thereby pull out the original message from that set of collisions - all but a few will be gibberish, and of the ones that aren't there's still a huge number to sift through if the message length is any significant length. The downside of a crypto hash is that they are slow to compute...in general.

So, what's it all mean then for Git? Not much. The hashes get done so rarely (relative to everything else) that their computational penalty is low overall to operations. The chances of hitting a pair of collisions is so low, it's not a realistic chance to occur and not be detected immediately (ie. your code would most likely suddenly stop building), allowing the user to fix the problem (back up a revision, and make the change again, and you'll almost certainly get a different hash because of the time change, which also feeds the hash in git). There is more likely for it to be a real problem for you if you're storing arbitrary binaries in git, which isn't really what it's primary use model is. If you want to do that...you're probably better off using a traditional database.

It's not wrong to think about this - it's a good question that a lot of people just pass off as "so unlikely it's not worth thinking about" - but it's really a little more complicated than that. If it DOES happen, it should be very readily detectible, it won't be a silent corruption in a normal workflow.

Burhans answered 30/6, 2013 at 0:3 Comment(5)
you'll almost certainly get a different hash because of the time change, which also feeds the hash in git Isn't the hash based solely on the contents of a file?Hick
The hash of a blob is based on the contents of a file (with a tiny bit of metadata), however the hash of a commit (which in theory could also collide) contains the current time, as well as the hash of the tree, the author, the hashes of parent commits etc. However, as @Burhans points out, small things are less likely to collide, and a commit is a small thing.Derian
Don't think I agree with the "The shorter the data, the LESS likely [collisions] will be". If you mean shorter hashes, then you're reducing the set of possible hashes = more inputs map to each hash = higher collision chance. If you mean shorter messages you're hashing, then this is only true in the sense that the number of possible inputs is limited by the number of characters used, which seems so obvious I feel I must be missing your point?Overexert
I never thought of the "VERY SIMILAR" point, which is a really good point. It basically means that in order to have 2 commits with the same hash, you would need to change a significant portion of the characters in every single file (not to mention the file names, paths and the number of files).Immersed
@Immersed No, in order to get a specific hash, from an arbitrary initial file, you would typically have to change the information in the file by an amount similar to the number of bits of information in the hash, i.e., around 160 bits for SHA-1. However, information about which bits to change also counts here, so the longer the file, the fewer bits you have to change if you choose the correct ones. Hypothetically, given a file of length well above 2^160 bytes, you could get almost any hash by changing a single bit, since the location of that bit carries more than 160 bits of information!Malleable
F
16

You can see a good study in "How would Git handle a SHA-1 collision on a blob?".

Since a SHA1 collision is now possible (as I reference in this answer with shattered.io), know that Git 2.13 (Q2 2017) will improve/mitigate the current situation with a "detect attempt to create collisions" variant of SHA-1 implementation by Marc Stevens (CWI) and Dan Shumow (Microsoft).

See commit f5f5e7f, commit 8325e43, commit c0c2006, commit 45a574e, commit 28dc98e (16 Mar 2017) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 48b3693, 24 Mar 2017)

Makefile: make DC_SHA1 the default

We used to use the SHA1 implementation from the OpenSSL library by default.
As we are trying to be careful against collision attacks after the recent "shattered" announcement, switch the default to encourage people to use DC_SHA1 implementation instead.
Those who want to use the implementation from OpenSSL can explicitly ask for it by OPENSSL_SHA1=YesPlease when running "make".

We don't actually have a Git-object collision, so the best we can do is to run one of the shattered PDFs through test-sha1. This should trigger the collision check and die.


Could Git be improved to live with that, or would I have to change to a new hash algorithm?

Update Dec. 2017 with Git 2.16 (Q1 2018): this effort to support an alternative SHA is underway: see "Why doesn't Git use more modern SHA?".

You will be able to use another hash algorithm: SHA1 is no longer the only one for Git.


Git 2.18 (Q2 2018) documents that process.

See commit 5988eb6, commit 45fa195 (26 Mar 2018) by Ævar Arnfjörð Bjarmason (avar).
(Merged by Junio C Hamano -- gitster -- in commit d877975, 11 Apr 2018)

doc hash-function-transition: clarify what SHAttered means

Attempt to clarify what the SHAttered attack means in practice for Git.
The previous version of the text made no mention whatsoever of Git already having a mitigation for this specific attack, which the SHAttered researchers claim will detect cryptanalytic collision attacks.

I may have gotten some of the nuances wrong, but as far as I know this new text accurately summarizes the current situation with SHA-1 in git. I.e. git doesn't really use SHA-1 anymore, it uses Hardened-SHA-1 (they just so happen to produce the same outputs 99.99999999999...% of the time).

Thus the previous text was incorrect in asserting that:

[...]As a result [of SHAttered], SHA-1 cannot be considered cryptographically secure any more[...]

That's not the case. We have a mitigation against SHAttered, however we consider it prudent to move to work towards a NewHash should future vulnerabilities in either SHA-1 or Hardened-SHA-1 emerge.

So the new documentation now reads:

Git v2.13.0 and later subsequently moved to a hardened SHA-1 implementation by default, which isn't vulnerable to the SHAttered attack.

Thus Git has in effect already migrated to a new hash that isn't SHA-1 and doesn't share its vulnerabilities, its new hash function just happens to produce exactly the same output for all known inputs, except two PDFs published by the SHAttered researchers, and the new implementation (written by those researchers) claims to detect future cryptanalytic collision attacks.

Regardless, it's considered prudent to move past any variant of SHA-1 to a new hash. There's no guarantee that future attacks on SHA-1 won't be published in the future, and those attacks may not have viable mitigations.

If SHA-1 and its variants were to be truly broken, Git's hash function could not be considered cryptographically secure any more. This would impact the communication of hash values because we could not trust that a given hash value represented the known good version of content that the speaker intended.

Note: that same document now (Q3 2018, Git 2.19) explicitly references the "new hash" as SHA-256: see "Why doesn't Git use more modern SHA?".

Fulltime answered 11/4, 2017 at 20:46 Comment(2)
This is the only decent answer or comment here. Summary is - though extremely unlikely, it's possible. They would also be immediately unidentifiable, and remedied through tweaking a file (with a comment) to avoid the collision. Intentional exploits are thought to be irrelevant, because someone could just as easily check in "bad code" - and there are things like signatures and deliberate pull requests to procedural prevent random people from checking in random things.Particia
Thanks for that amazing rundown.Phillip
C
6

Could git be improved to live with that, or would I have to change to a new hash algorithm?

Collisions are possible for any hash algorithm, so changing the hash function doesn't preclude the problem, it just makes it less likely to happen. So you should choose then a really good hash function (SHA-1 already is, but you asked not to be told :)

Caramel answered 3/5, 2012 at 15:26 Comment(2)
I think you mean "more unlikely" or "less likely", right? Sure you could change to a hash algorithm with less bytes in the output, but that is not would you meant, right? :)Intersex
SHA-1 is broken in the sense that it will become possible to create deliberate hash collisions. I think it already was in 2012 as well. So changing to a different hash which is more secure and has a larger state & output certainly would make a difference.Reconnoitre
S
6

Google now claims that SHA-1 collision is possible under certain preconditions: https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html

Since git uses SHA-1 to check for file integrity, this means that file integrity in git is compromised.

IMO, git should definitely use a better hashing algorithm since deliberate collision is now possible.

Safeguard answered 24/2, 2017 at 3:22 Comment(1)
Also, it would be prudent not to trust Linus' word regarding computer security. He has been wrong before, and he's wrong on this one. (For instance, a SHA-1 collision oracle lets one create circular commit histories to crash servers and clients alike)Fleurdelis
T
2

A hash collision is so highly unlikely, that it is sheer mind blowing! Scientists all over the world are trying hard to achieve one, but didn't manage it yet. For certain algorithms such as MD5 they successed, though.

What are the odds?

SHA-256 has 2^256 possible hashes. That is about 10^78. Or to be more graphic, the chances of a collision are at about

1 : 100 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000

The chance of winning the lottery is about 1 : 14 Mio. The chance of a collision with SHA-256 is like winning the lottery on 11 consecutive days!

Mathematic explanation: 14 000 000 ^ 11 ~ 2^256

Furthermore, the universe has about 10^80 atoms. That's just 100 times more than there are SHA-256 combinations.

Successful MD5 collision

Even for MD5 the chances are tiny. Though, mathematicians managed to create a collision:

d131dd02c5e6eec4 693d9a0698aff95c 2fcab58712467eab 4004583eb8fb7f89
55ad340609f4b302 83e488832571415a 085125e8f7cdc99f d91dbdf280373c5b
d8823e3156348f5b ae6dacd436c919c6 dd53e2b487da03fd 02396306d248cda0
e99f33420f577ee8 ce54b67080a80d1e c69821bcb6a88393 96f9652b6ff72a70

has the same MD5 as

d131dd02c5e6eec4 693d9a0698aff95c 2fcab50712467eab 4004583eb8fb7f89
55ad340609f4b302 83e4888325f1415a 085125e8f7cdc99f d91dbd7280373c5b
d8823e3156348f5b ae6dacd436c919c6 dd53e23487da03fd 02396306d248cda0
e99f33420f577ee8 ce54b67080280d1e c69821bcb6a88393 96f965ab6ff72a70

This doesn't mean that MD5 is less safe now that its algorithm is cracked. You can create MD5 collisions on purpose, but the chance of an accidental MD5 collision is still 2^128, which is still a lot.

Conclusion

You don't have to have a single worry about collisions. Hashing algorithms are the second safest way to check file sameness. The only safer way is a binary comparison.

Tu answered 19/3, 2015 at 13:50 Comment(5)
This answer talks mostly about SHA-256, which is irrelevant since the question was about SHA-1. The math showing the unlikeliness of a SHA-256 collision is much more optimistic than a SHA-1 would result in. It's still very unlikely, but a SHA-1 answer would have been more relevant.Dennadennard
@AndrewArnott There is no relevant difference between SHA-256 and SHA-1. SHA-1 is 2^128 times weaker, but this also doesn't matter. It's still not breakable, so my answer isn't so misplaced.Tu
SHA-1 is indeed broken so saying it's "still not breakable" is also incorrect. Given SHA-1 is in fact broken, someone could conceivably intentionally attack git's sha-1 algorithm to replace content without being detected. SHA-256 has not yet been broken, so it would be more secure. Thus, answering a question about potential git collisions would be best kept to SHA-1.Dennadennard
"This doesn't mean that MD5 is less safe now that its algorithm is cracked." Come again? Could you explain that sentence?Reconnoitre
Reason for the answer: Because there is a lot of confusion among people who are not familiar with computing and still land here from searching the web. Misconceptions about "encryption vs. computing power" are in my experience more common than you think so I addressed this as additional information.Tu
L
1

Well I guess we now know what would happen - you should expect that your repository would become corrupted (source).

Leaven answered 25/2, 2017 at 10:38 Comment(0)
S
1

I recently found a posting from 2013-04-29 in a BSD discussion group at

http://openbsd-archive.7691.n7.nabble.com/Why-does-OpenBSD-use-CVS-td226952.html

where the poster claims:

I ran into a hash collision once, using git rebase.

Unfortunately, he provides no proof for his claim. But maybe you would like trying to contact him and ask him about this supposed incident.

But on a more general level, due to the birthday attack a chance for an SHA-1 hash collision is 1 in pow(2, 80).

This sounds a lot and is certainly way more than the total number of versions of individual files present in all Git repositories of the world combined.

However, this only applies to the versions which actually remain in version history.

If a developer relies very much on rebasing, every time a rebase is run for a branch, all the commits in all the versions of that branch (or rebased part of the branch) get new hashes. The same is true for every file modifies with "git filter-branch". Therefore, "rebase" and "filter-branch" might be big multipliers for the number of hashes generated over time, even though not all of them are actually kept: Frequently, after rebasing (especially for the purpose of "cleaning up" a branch), the original branch is thrown away.

But if the collision occurs during the rebase or filter-branch, it can still have adverse effects.

Another thing would be to estimate the total number of hashed entities in git repositories and see how far they are from pow(2, 80).

Let's say we have about 8 billion people, and all of them would be running git and keep their stuff versioned in 100 git repositories per person. Let' further assume the average repository has 100 commits and 10 files, and only one of those files changes per commit.

For every revision we have at least a hash for the tree object and the commit object itself. Together with the changed file we have 3 hashes per revision, and thus 300 hashes per repository.

For 100 repositories of 8 billion people this gives pow(2, 47) which is still far from pow(2, 80).

However, this does not include the supposed multiplication effect mentioned above, because I am uncertain how to include it in this estimation. Maybe it could increase the chances for a collision considerably. Especially if very large repositories which a long commit history (like the Linux Kernel) are rebased by many people for small changes, which nevertheless create different hashes for all affected commits.

Stemware answered 26/1, 2018 at 7:34 Comment(1)
Interesting. +1. As I mention above, this issue will go away eventually: https://mcmap.net/q/12344/-why-doesn-39-t-git-use-more-modern-shaFulltime

© 2022 - 2024 — McMap. All rights reserved.