How would Git handle a SHA-1 collision on a blob?
Asked Answered
T

6

598

This probably never happened in the real-world yet, and may never happen, but let's consider this: say you have a git repository, make a commit, and get very very unlucky: one of the blobs ends up having the same SHA-1 as another that is already in your repository. Question is, how would Git handle this? Simply fail? Find a way to link the two blobs and check which one is needed according to the context?

More a brain-teaser than an actual problem, but I found the issue interesting.

Toreutic answered 22/2, 2012 at 9:42 Comment(7)
Once a brain teaser, now potentially an actual problem.Coly
@Toby This question was about a pre-image attack; what Google demonstrated is a collision attack -- similar but slightly different. You can read more about the difference here.Franglais
@Saheed I don't see what part of this question is about a pre-image attack specifically, as the question posed is just about a collision in a git repository, not about exploiting it.Coly
@Toby The original brain teaser was not about an attack (neither pre-image nor collision) but about an accidental collision which is so unfathomably unlikely that it is not worth considering. I think what Saheed was correctly trying to say that this is still not an actual problem. However, you are right that the Google collision attack has potentially created an security problem depending on how Git is used.Carrollcarronade
Here's a second collision that's only 320 bytes privacylog.blogspot.com/2019/12/the-second-sha-collision.htmlEvangelineevangelism
With the development of a chosen prefix attack for SHA-1; this is moving into the realm of the possible. arstechnica.com/information-technology/2020/01/…Hyperbola
Why git doesn't only check existence of SHA-1 collusion? All problems would be solved.Ely
W
814

I did an experiment to find out exactly how Git would behave in this case. This is with version 2.7.9~rc0+next.20151210 (Debian version). I basically just reduced the hash size from 160-bit to 4-bit by applying the following diff and rebuilding git:

--- git-2.7.0~rc0+next.20151210.orig/block-sha1/sha1.c
+++ git-2.7.0~rc0+next.20151210/block-sha1/sha1.c
@@ -246,6 +246,8 @@ void blk_SHA1_Final(unsigned char hashou
    blk_SHA1_Update(ctx, padlen, 8);

    /* Output hash */
-   for (i = 0; i < 5; i++)
-       put_be32(hashout + i * 4, ctx->H[i]);
+   for (i = 0; i < 1; i++)
+       put_be32(hashout + i * 4, (ctx->H[i] & 0xf000000));
+   for (i = 1; i < 5; i++)
+       put_be32(hashout + i * 4, 0);
 }

Then I did a few commits and noticed the following.

  1. If a blob already exists with the same hash, you will not get any warnings at all. Everything seems to be ok, but when you push, someone clones, or you revert, you will lose the latest version (in line with what is explained above).
  2. If a tree object already exists and you make a blob with the same hash: Everything will seem normal, until you either try to push or someone clones your repository. Then you will see that the repo is corrupt.
  3. If a commit object already exists and you make a blob with the same hash: same as #2 - corrupt
  4. If a blob already exists and you make a commit object with the same hash, it will fail when updating the "ref".
  5. If a blob already exists and you make a tree object with the same hash. It will fail when creating the commit.
  6. If a tree object already exists and you make a commit object with the same hash, it will fail when updating the "ref".
  7. If a tree object already exists and you make a tree object with the same hash, everything will seem ok. But when you commit, all of the repository will reference the wrong tree.
  8. If a commit object already exists and you make a commit object with the same hash, everything will seem ok. But when you commit, the commit will never be created, and the HEAD pointer will be moved to an old commit.
  9. If a commit object already exists and you make a tree object with the same hash, it will fail when creating the commit.

For #2 you will typically get an error like this when you run "git push":

error: object 0400000000000000000000000000000000000000 is a tree, not a blob
fatal: bad blob object
error: failed to push some refs to origin

or:

error: unable to read sha1 file of file.txt (0400000000000000000000000000000000000000)

if you delete the file and then run "git checkout file.txt".

For #4 and #6, you will typically get an error like this:

error: Trying to write non-commit object
f000000000000000000000000000000000000000 to branch refs/heads/master
fatal: cannot update HEAD ref

when running "git commit". In this case you can typically just type "git commit" again since this will create a new hash (because of the changed timestamp)

For #5 and #9, you will typically get an error like this:

fatal: 1000000000000000000000000000000000000000 is not a valid 'tree' object

when running "git commit"

If someone tries to clone your corrupt repository, they will typically see something like:

git clone (one repo with collided blob,
d000000000000000000000000000000000000000 is commit,
f000000000000000000000000000000000000000 is tree)

Cloning into 'clonedversion'...
done.
error: unable to read sha1 file of s (d000000000000000000000000000000000000000)
error: unable to read sha1 file of tullebukk
(f000000000000000000000000000000000000000)
fatal: unable to checkout working tree
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry the checkout with 'git checkout -f HEAD'

What "worries" me is that in two cases (2,3) the repository becomes corrupt without any warnings, and in 3 cases (1,7,8), everything seems ok, but the repository content is different than what you expect it to be. People cloning or pulling will have a different content than what you have. The cases 4,5,6 and 9 are ok, since it will stop with an error. I suppose it would be better if it failed with an error at least in all cases.

Wynellwynn answered 4/1, 2016 at 20:15 Comment(9)
Awesome answer - reducing the hash size to see how it actually behaves is a great idea.Toreutic
How probable is it that this actually happens without reduced hash size?Leasehold
To add to Mathias's qts : I'm guessing this probability depending on the # or commits. Is that true?Sinuosity
Also, what are the plans if any to move to another hashing algorithm.Sinuosity
@MathiasBader assuming that you would add blobs made of uniformly distributed random bytes then you would be hit by birthday paradox and it would be around 2^-80, however we could assume that in case of Git repo it is never the case (as it would be mostly code which isn't uniformly distributed vector of bytes) so the probability is even lower. @Sinuosity it is, but it is still quite small. And AFAIK there are no plans to move to another hash function (I think it would be quite hard to perform) as there is no need.Anadiplosis
@lapo So far that can only fall into the three cases where the repository content is different from what you expect (where the objects are the same type), two different valid commits would need to differ only in the commit message (and would have either very weird commit messages or functionally identical ones) and two different valid tree objects would (I think; I could be wrong) need to differ only in the filenames (which would be weird binary filenames). Two blobs with the same hash could still be a problem. Other attacks might still be invented later, of course.Baghdad
I agree that this is a good answer from a technical standpoint. But even without having read Torvald's statement, every developer worthy of that job title should know a bit about hash functions in general and for what purpose they are used in Git in specific. Then it is easy to deduce that the question posed here is clearly a first-world problem. I just shrugged when I read the question, thinking: "Yeah, maybe a nifty exercise to think about it, but not a practical problem." The OP says the same, I just want to confirm. Just my two cents.Bailee
@phil_lgr were Linus's remarks backed up anywhere when G+ was shut down?Hyperbola
Newer working version of the google plus link can be found here using wayback machine.Seacock
C
255

Original answer (2012) (see shattered.io 2017 SHA1 collision below)

That old (2006) answer from Linus might still be relevant:

Nope. If it has the same SHA1, it means that when we receive the object from the other end, we will not overwrite the object we already have.

So what happens is that if we ever see a collision, the "earlier" object in any particular repository will always end up overriding. But note that "earlier" is obviously per-repository, in the sense that the git object network generates a DAG that is not fully ordered, so while different repositories will agree about what is "earlier" in the case of direct ancestry, if the object came through separate and not directly related branches, two different repos may obviously have gotten the two objects in different order.

However, the "earlier will override" is very much what you want from a security standpoint: remember that the git model is that you should primarily trust only your own repository.
So if you do a "git pull", the new incoming objects are by definition less trustworthy than the objects you already have, and as such it would be wrong to allow a new object to replace an old one.

So you have two cases of collision:

  • the inadvertent kind, where you somehow are very very unlucky, and two files end up having the same SHA1.
    At that point, what happens is that when you commit that file (or do a "git-update-index" to move it into the index, but not committed yet), the SHA1 of the new contents will be computed, but since it matches an old object, a new object won't be created, and the commit-or-index ends up pointing to the old object.
    You won't notice immediately (since the index will match the old object SHA1, and that means that something like "git diff" will use the checked-out copy), but if you ever do a tree-level diff (or you do a clone or pull, or force a checkout) you'll suddenly notice that that file has changed to something completely different than what you expected.
    So you would generally notice this kind of collision fairly quickly.
    In related news, the question is what to do about the inadvertent collision..
    First off, let me remind people that the inadvertent kind of collision is really really really damn unlikely, so we'll quite likely never ever see it in the full history of the universe.
    But if it happens, it's not the end of the world: what you'd most likely have to do is just change the file that collided slightly, and just force a new commit with the changed contents (add a comment saying "/* This line added to avoid collision */") and then teach git about the magic SHA1 that has been shown to be dangerous.
    So over a couple of million years, maybe we'll have to add one or two "poisoned" SHA1 values to git. It's very unlikely to be a maintenance problem ;)

  • The attacker kind of collision because somebody broke (or brute-forced) SHA1.
    This one is clearly a lot more likely than the inadvertent kind, but by definition it's always a "remote" repository. If the attacker had access to the local repository, he'd have much easier ways to screw you up.
    So in this case, the collision is entirely a non-issue: you'll get a "bad" repository that is different from what the attacker intended, but since you'll never actually use his colliding object, it's literally no different from the attacker just not having found a collision at all, but just using the object you already had (ie it's 100% equivalent to the "trivial" collision of the identical file generating the same SHA1).

The question of using SHA-256 is regularly mentioned, but not act upon for now (2012).
Note: starting 2018 and Git 2.19, the code is being refactored to use SHA-256.


Note (Humor): you can force a commit to a particular SHA1 prefix, with the project gitbrute from Brad Fitzpatrick (bradfitz).

gitbrute brute-forces a pair of author+committer timestamps such that the resulting git commit has your desired prefix.

Example: https://github.com/bradfitz/deadbeef


Daniel Dinnyes points out in the comments to 7.1 Git Tools - Revision Selection, which includes:

A higher probability exists that every member of your programming team will be attacked and killed by wolves in unrelated incidents on the same night.


Even the more recently (February 2017) shattered.io demonstrated the possibility of forging a SHA1 collision:
(see much more in my separate answer, including Linus Torvalds' Google+ post)

  • a/ still requires over 9,223,372,036,854,775,808 SHA1 computations. This took the equivalent processing power as 6,500 years of single-CPU computations and 110 years of single-GPU computations.
  • b/ would forge one file (with the same SHA1), but with the additional constraint its content and size would produce the identical SHA1 (a collision on the content alone is not enough): see "How is the git hash calculated?"): a blob SHA1 is computed based on the content and size.

See "Lifetimes of cryptographic hash functions" from Valerie Anita Aurora for more.
In that page, she notes:

Google spent 6500 CPU years and 110 GPU years to convince everyone we need to stop using SHA-1 for security critical applications.
Also because it was cool

See more in my separate answer below.

Comedic answered 22/2, 2012 at 9:53 Comment(12)
twist: still hashes the same after adding /* This line added to avoid collision */ :D you can win the lottery twice :POrangeman
@JanusTroelsen sure, but it is still a lottery, is it not? ;) (as mention in this short note about SHA1)Comedic
@Comedic regarding that reference: is an outburst of a global werewolf epidemic - wiping out all humanity and resulting in the gruesome death of all my developers on the same night, even though they were geographically distributed - considered an unrelated incident?? Of course, assuming it happened on a full moon, obviously. Now, such a scenario would change things. Even thinking about it is insanity! That is on an entire different scale of probability! That would mean we must... STOP USING GIT! NOW!!! EVERYONE RUUUUUN!!!!!!!Smithsonite
Note that the gitbrute doesn't force a particular SHA1 but only a only a prefix (ie a subpart of the whole SHA1). Forcing an entire SHA1 (ie with a prefix of the full length of the key) will probably take "too long".Strohl
@Strohl I agree. That is why in my recent answer https://mcmap.net/q/12350/-why-does-git-use-a-cryptographic-hash-function, I do emphasis the word prefix. I have edited this answer accordingly.Comedic
link in my previous comment has been moved: that reference. Also the quote from the paragraph: A higher probability exists that every member of your programming team will be attacked and killed by wolves in unrelated incidents on the same night.Smithsonite
@DanielDinnyes Good one. I have included your comment in the answer for more visibility.Comedic
@JanusTroelsen Then you would add: /* This line added to avoid collision of the avoid collision line */Gouveia
There's been further discussion amongst Linus and other devs today. He continues to think it's not a big deal, but "I'm not arguing that people shouldn't work on extending git to a new (and bigger) hash. I think that's a no-brainer, and we do want to have a path to eventually move towards SHA3-256 or whatever."Signification
@jer which means a bigger hash, requiring more computational power. That change is definitely not free...Comedic
@Comedic That's not an objection I'd have anticipated. Storing and computing the hashes should be an infinitesimal part of the storage and computation required for typical Git operations. (but it is probably going to be an expensive change, because it's replacing a part of the protocol that wasn't designed to be changed)Signification
Here's a second collision: privacylog.blogspot.com/2019/12/the-second-sha-collision.htmlEvangelineevangelism
D
44

According to Pro Git:

If you do happen to commit an object that hashes to the same SHA-1 value as a previous object in your repository, Git will see the previous object already in your Git database and assume it was already written. If you try to check out that object again at some point, you’ll always get the data of the first object.

So it wouldn't fail, but it wouldn't save your new object either.
I don't know how that would look on the command line, but that would certainly be confusing.

A bit further down, that same reference attempts to illustrate the likely-ness of such a collision:

Here’s an example to give you an idea of what it would take to get a SHA-1 collision. If all 6.5 billion humans on Earth were programming, and every second, each one was producing code that was the equivalent of the entire Linux kernel history (1 million Git objects) and pushing it into one enormous Git repository, it would take 5 years until that repository contained enough objects to have a 50% probability of a single SHA-1 object collision. A higher probability exists that every member of your programming team will be attacked and killed by wolves in unrelated incidents on the same night.

Disconsolate answered 22/2, 2012 at 9:52 Comment(9)
I'd like to see the source for the numbers on the last sentence ;-)Concoct
@JoachimSauer Why don't you follow the link given?Dace
@Jasper: that link is good documentation, but it does not contain statistics on the probability of every member of a team being attacked and killed by wolves in unrelated incidents on the same night.Concoct
@JoachimSauer My bad. I thought you were talking about the five years. When talking abvout wolves, though, I'd like to hear you argue how the chance on getting eaten by wolves is less than one in 10^20 ;)Dace
@Jasper: Well, the way I read it, the text literally claims that the probability of 6.5 billion team members getting killed by wolves on the same night is higher than 50%. But my main objection to his statement is that such an event would have to be a worldwide phenomenon; it's inconceivable that this could occur due to unrelated incidents. ;)Aleksandr
@KeithRobertson I am pretty sure the post is talking about the chance of all your actual team members being eaten compared to the chance of a hash collision if everyone on the world was producing insane amounts of code, alongside the time it takes under those circumstances to get to a 50% chance of a collision (i.e. the wolves incident didn't involve the entire world and the 50% was separate from the wolves). You did get the point though, if such an event is inconceivable, so should a git hash collision be. (Of course, one is (almost) purely chance based and the other isn't, but still.)Dace
@Jasper: Oh, yes, I see; I mistook the 6.5 billion as part of the programming team being killed. (Oh well.) However, I don't agree that as written, the 50% is separate from the wolves. He wrote "a higher probability." Higher than what? That probability previously mentioned would be the typical implication. What probability was immediately prior? 50%. I know what the author meant, but with a direct interpretation, it's not what he said, which I point out purely because it's funny (at least to me).Aleksandr
It's frightening to see that a quote by someone who clearly hasn't understood the non-linear nature of probability is used it to justify why no upgrade to safer alternatives is done (or at least provisioned for in the architecture, like switching an existing repository to SHA1 + another hash algo). Preferably with mathematically unrelated hash algos.Potboiler
Watch out for wolves tonightColy
C
30

To add to my previous answer from 2012, there is now (Feb. 2017, five years later), an example of actual SHA-1 collision with shattered.io, where you can craft two colliding PDF files: that is obtain a SHA-1 digital signature on the first PDF file which can also be abused as a valid signature on the second PDF file.
See also "At death’s door for years, widely used SHA1 function is now dead", and this illustration.

Update 26 of February: Linus confirmed the following points in a Google+ post:

(1) First off - the sky isn't falling. There's a big difference between using a cryptographic hash for things like security signing, and using one for generating a "content identifier" for a content-addressable system like git.

(2) Secondly, the nature of this particular SHA1 attack means that it's actually pretty easy to mitigate against, and there's already been two sets of patches posted for that mitigation.

(3) And finally, there's actually a reasonably straightforward transition to some other hash that won't break the world - or even old git repositories.

Regarding that transition, see the Q1 2018 Git 2.16 adding a structure representing hash algorithm. The implementation of that transition has started.

Starting Git 2.19 (Q3 2018), Git has picked SHA-256 as NewHash, and is in the process of integrating it to the code (meaning SHA1 is still the default (Q2 2019, Git 2.21), but SHA2 will be the successor)


Original answer (25th of February) But:

Joey Hess tries those pdf in a Git repo and he found:

That includes two files with the same SHA and size, which do get different blobs thanks to the way git prepends the header to the content.

joey@darkstar:~/tmp/supercollider>sha1sum  bad.pdf good.pdf 
d00bbe65d80f6d53d5c15da7c6b4f0a655c5a86a  bad.pdf
d00bbe65d80f6d53d5c15da7c6b4f0a655c5a86a  good.pdf
joey@darkstar:~/tmp/supercollider>git ls-tree HEAD
100644 blob ca44e9913faf08d625346205e228e2265dd12b65    bad.pdf
100644 blob 5f90b67523865ad5b1391cb4a1c010d541c816c1    good.pdf

While appending identical data to these colliding files does generate other collisions, prepending data does not.

So the main vector of attack (forging a commit) would be:

  • Generate a regular commit object;
  • use the entire commit object + NUL as the chosen prefix, and
  • use the identical-prefix collision attack to generate the colliding good/bad objects.
  • ... and this is useless because the good and bad commit objects still point to the same tree!

Plus, you already can and detect cryptanalytic collision attacks against SHA-1 present in each file with cr-marcstevens/sha1collisiondetection

Adding a similar check in Git itself would have some computation cost.

On changing hash, Linux comments:

The size of the hash and the choice of the hash algorithm are independent issues.
What you'd probably do is switch to a 256-bit hash, use that internally and in the native git database, and then by default only show the hash as a 40-character hex string (kind of like how we already abbreviate things in many situations).
That way tools around git don't even see the change unless passed in some special "--full-hash" argument (or "--abbrev=64" or whatever - the default being that we abbreviate to 40).

Still, a transition plan (from SHA1 to another hash function) would still be complex, but actively studied.
A convert-to-object_id campaign is in progress:


Update 20th of March: GitHub detail a possible attack and its protection:

SHA-1 names can be assigned trust through various mechanisms. For instance, Git allows you to cryptographically sign a commit or tag. Doing so signs only the commit or tag object itself, which in turn points to other objects containing the actual file data by using their SHA-1 names. A collision in those objects could produce a signature which appears valid, but which points to different data than the signer intended. In such an attack the signer only sees one half of the collision, and the victim sees the other half.

Protection:

The recent attack uses special techniques to exploit weaknesses in the SHA-1 algorithm that find a collision in much less time. These techniques leave a pattern in the bytes which can be detected when computing the SHA-1 of either half of a colliding pair.

GitHub.com now performs this detection for each SHA-1 it computes, and aborts the operation if there is evidence that the object is half of a colliding pair. That prevents attackers from using GitHub to convince a project to accept the "innocent" half of their collision, as well as preventing them from hosting the malicious half.

See "sha1collisiondetection" by Marc Stevens


Again, with Q1 2018 Git 2.16 adding a structure representing hash algorithm, the implementation of a transition to a new hash has started.
As mentioned above, the new supported Hash will be SHA-256.

Comedic answered 25/2, 2017 at 0:5 Comment(7)
The collision: 1. The attempt was to create a collision, not one occurring by coincidence. 2. From te PDF report: In total the computational effort spent is equivalent to 2^63.1 SHA-1 compressions and took approximately 6,500 CPU years and 100 GPU years. 3. Although we should move on from MD5 and SHA-1 the are in general fine for file unique usages.Drillmaster
It's worth noting that WebKit checked in the colliding PDFs for a test. It broke their git-svn mirror infrastructure: bugs.webkit.org/show_bug.cgi?id=168774#c24Letta
@Letta It is worth noting indeed... in that I noted it in the answer (the link behind "It does have some issue for git-svn though" refers to it, albeit indirectly)Comedic
Well is the point raised in the original answer addressed ? I suppose it would be better if it failed with an error at least in all cases. - does it now fail with an error ? I fully understand how improbable it is but correctness is correctnessCrossstaff
@Crossstaff no it does not yet fail with an error. A big patch is in progress which will then help facilitate that collision detection: marc.info/?l=git&m=148987267504882&w=2Comedic
@Crossstaff SEe edit 4 in stackoverflow.com/posts/42450327/revisions: it does fail now, at least when upshed to GitHub.Comedic
I would add that 100 GPU years to attack a particular project via submodules is not that expensive. It may be "prohibitive" in general but not all projects are created equal: if it allows to install a backdoor in a crypto library used by wallets and other crypto currency projects the ROI is there. 100 GPU years costs less than $500,000 on GCP for exampleBombard
C
6

I think cryptographers would celebrate.

Quote from Wikipedia article on SHA-1:

In February 2005, an attack by Xiaoyun Wang, Yiqun Lisa Yin, and Hongbo Yu was announced. The attacks can find collisions in the full version of SHA-1, requiring fewer than 2^69 operations. (A brute-force search would require 2^80 operations.)

Conservatoire answered 22/2, 2012 at 10:37 Comment(4)
The point is that a flaw has been found in SHA1 and that this was about the time when Git was being introduced. Also, probability is non-linear. Just because you play the lottery for fifty years does not mean you have a higher chance of winning. You just have the same chance every single time. The person playing for the first time can still win.Potboiler
This is only attack that find collision, which mean that you can find y such that h(x) == h(y)` which is serious threat for arbitrary data like SSL certificates however this do not affect Git which would be vulnerable to second pre-image attack which mean that having message x you can modify it to message x' that h(x) == h(x'). So this attack doesn't weaken Git. Also Git haven't chosen SHA-1 for security reasons.Anadiplosis
Now a collision has been found - just not one that bothers git directly yet. stackoverflow.com/questions/42433126/…Conservatoire
2^69 is about 600 Exa-operations. Eight years later, Nvidia's SaturnV super computer upgraded with their A100 can do 4.6 ExaOPS, so it could potentially solve this in a bit over 2 minutes, or do a brute force attack in a few days.Nippon
M
6

There are several different attack models for hashes like SHA-1, but the one usually discussed is collision search, including Marc Stevens' HashClash tool.

"As of 2012, the most efficient attack against SHA-1 is considered to be the one by Marc Stevens[34] with an estimated cost of $2.77M to break a single hash value by renting CPU power from cloud servers."

As folks pointed out, you could force a hash collision with git, but doing so won't overwrite the existing objects in another repository. I'd imagine even git push -f --no-thin won't overwrite the existing objects, but not 100% sure.

That said, if you hack into a remote repository then you could make your false object the older one there, possibly embedding hacked code into an open source project on github or similar. If you were careful then maybe you could introduce a hacked version that new users downloaded.

I suspect however that many things the project's developers might do could either expose or accidentally destroy your multi-million dollar hack. In particular, that's a lot of money down the drain if some developer, who you didn't hack, ever runs the aforementioned git push --no-thin after modifying the effected files, sometimes even without the --no-thin depending.

Maxima answered 19/12, 2014 at 11:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.