How much of a git sha is *generally* considered necessary to uniquely identify a change in a given codebase?
Asked Answered
L

5

304

If you're going to build, say, a directory structure where a directory is named for a commit in a Git repository, and you want it to be short enough to make your eyes not bleed, but long enough that the chance of it colliding would be negligible, how much of the SHA substring is generally required?

Let's say I want to uniquely identify this change: https://github.com/wycats/handlebars.js/commit/e62999f9ece7d9218b9768a908f8df9c11d7e920

I can use as little as the first four characters: https://github.com/wycats/handlebars.js/commit/e629

But I feel like that would be risky. But assuming a codebase that, over a couple of years, might have—say—30k changes, what are the chances of collision if I use 8 characters? 12? Is there a number that's generally considered acceptable for this sort of thing?

Lumber answered 8/8, 2013 at 19:37 Comment(1)
Related: #32406422Ego
E
328

This question is actually answered in Chapter 7 of the Pro Git book:

Generally, eight to ten characters are more than enough to be unique within a project. One of the largest Git projects, the Linux kernel, is beginning to need 12 characters out of the possible 40 to stay unique.

7 digits are the Git default for a short SHA, so that's fine for most projects. The Kernel team has increased theirs several times, as mentioned because they have several hundred thousand commits. So for your ~30k commits, 8 or 10 digits should be perfectly fine.

Esurient answered 8/8, 2013 at 19:54 Comment(4)
Also note that git is fairly smart when it comes to this. You can set the abbreviation short, say to 4, and git will use 4 digits for as many hashes as it can, but switch to 5 or more when it knows that the abbreviation is not unique...Clamshell
Note also, though, that this of course only applies for the moment Git prints the SHA. If you "save" abbreviated SHAs (say, in logs, emails, IMs, etc.) and use them later to refer to commits, they might no longer be unique! While certainly unlikely for normal lengths like 7-12 characters, if you do go down to 4 or 5, and you get a few ten thousand new objects (or commits, depending on context), this might indeed come back to bite you.Esurient
"7 digits are the default for a short SHA". While this is true in a sense, it may give people the impression that git will always use a 7-digit abbreviation unless told to do otherwise. Instead, git (now) dynamically computes what a short SHA is based on the number of objects in the repo (https://mcmap.net/q/56564/-how-much-of-a-git-sha-is-generally-considered-necessary-to-uniquely-identify-a-change-in-a-given-codebase). The 7-digit default assumption (mea culpa) has led to bugs in my programs.Aldenalder
@NevikRehnel Is that a problem in practice? If you save a an abbrev. commit in an email or DM or something you have two pieces of information: the abbrev. commit and a timestamp. Then you know to look for a commit which is older than that timestamp. Commits that came after it are irrelevant. That might be a more costly query than (presumably) just checking SHA1s withough inspecting the objects, but at least you don’t lose any information.Olenta
T
208

Note: you can ask git rev-parse --short for the shortest and yet unique SHA1.
See "git get short hash from regular hash"

git rev-parse --short=4 921103db8259eb9de72f42db8b939895f5651489
92110

As you can see in my example the SHA1 has a length of 5 even if I specified a length of 4.


For big repos, 7 isn't enough since 2010, and commit dce9648 by Linus Torvalds himself (git 1.7.4.4, Oct 2010):

The default of 7 comes from fairly early in git development, when seven hex digits was a lot (it covers about 250+ million hash values).
Back then I thought that 65k revisions was a lot (it was what we were about to hit in BK), and each revision tends to be about 5-10 new objects or so, so a million objects was a big number.

(BK = BitKeeper)

These days, the kernel isn't even the largest git project, and even the kernel has about 220k revisions (much bigger than the BK tree ever was) and we are approaching two million objects.
At that point, seven hex digits is still unique for a lot of them, but when we're talking about just two orders of magnitude difference between number of objects and the hash size, there will be collisions in truncated hash values.
It's no longer even close to unrealistic - it happens all the time.

We should both increase the default abbrev that was unrealistically small, and add a way for people to set their own default per-project in the git config file.

core.abbrev

Set the length object names are abbreviated to.
If unspecified, many commands abbreviate to 7 hexdigits, which may not be enough for abbreviated object names to stay unique for sufficiently long time.

environment.c:

int minimum_abbrev = 4, default_abbrev = 7;

Note: As commented below by marco.m, core.abbrevLength was renamed in core.abbrev in that same Git 1.7.4.4 in commit a71f09f

Rename core.abbrevlength back to core.abbrev

It corresponds to --abbrev=$n command line option after all.


More recently, Linus added in commit e6c587c (for Git 2.11, Q4 2016):
(as mentioned in Matthieu Moy's answer)

In fairly early days we somehow decided to abbreviate object names down to 7-hexdigits, but as projects grow, it is becoming more and more likely to see such a short object names made in earlier days and recorded in the log messages no longer unique.

Currently the Linux kernel project needs 11 to 12 hexdigits, while Git itself needs 10 hexdigits to uniquely identify the objects they have, while many smaller projects may still be fine with the original 7-hexdigit default. One-size does not fit all projects.

Introduce a mechanism, where we estimate the number of objects in the repository upon the first request to abbreviate an object name with the default setting and come up with a sane default for the repository. Based on the expectation that we would see collision in a repository with 2^(2N) objects when using object names shortened to first N bits, use sufficient number of hexdigits to cover the number of objects in the repository.
Each hexdigit (4-bits) we add to the shortened name allows us to have four times (2-bits) as many objects in the repository.

See commit e6c587c (01 Oct 2016) by Linus Torvalds (torvalds).
See commit 7b5b772, commit 65acfea (01 Oct 2016) by Junio C Hamano (gitster).
(Merged by Junio C Hamano -- gitster -- in commit bb188d0, 03 Oct 2016)

That new property (guessing a reasonnable default for SHA1 abbrev value) has a direct effect on how Git compute its own version number for release.

Terraqueous answered 9/1, 2014 at 8:25 Comment(8)
This answer provides a way to check what the longest "shortened" hash in a single repository is: https://mcmap.net/q/57065/-in-my-repo-how-long-must-the-longest-hash-prefix-be-to-prevent-any-overlapIntermediacy
Note that core.abbrevLength has been renamed to core.abbrev.Faucet
@Faucet Thank you. I have amended the answer accordingly. And I have linked to the Git commit which records that new name for core.abbrev.Terraqueous
I'll just add to this that you can run git rev-parse --short=10 --verify HEAD to generate 10 characters. We WERE using git log -1 --format=%h, but that only generated 7 characters and we got a collision.Espalier
Thanks for the explanation, the docs (git-scm.com/docs/git-rev-parse) are stale.Satrap
@AndréWerlang do you mean the link is no longer valid? Or its content not up-to-date?Terraqueous
@Terraqueous content is not up-to-date. I sent a patch.Satrap
@AndréWerlang OK. I don't see it yet in spinics.net/lists/gitTerraqueous
A
66

This is known as the birthday problem.

For probabilities less than 1/2 the probability of a collision can be approximated as

p ~= (n2)/(2m)

Where n is the number of items and m is the number of possibilities for each item.

The number of possibilities for a hex string is 16c where c is the number of characters.

So for 8 characters and 30K commits

30K ~= 215

p ~= (n2)/(2m) ~= ((215)2)/(2*168) = 230/233 = ⅛

Increasing it to 12 characters

p ~= (n2)/(2m) ~= ((215)2)/(2*1612) = 230/249 = 2-19

Alveta answered 2/3, 2017 at 22:36 Comment(2)
Exactly the question I was trying to solve, thank you! The probability table linked in @Messa's answer is also helpful.Distil
excellent, we need nothing else but more like this, explain it not only what is it but also how does it come...Bridwell
H
16

This question has been answered, but for anyone looking for the math behind - it's called Birthday problem (Wikipedia).

It is about the probability of having 2 (or more) people from group of N people to have birthday on the same day in year. Which is analogical to probabily of 2 (or more) git commits from repository having N commits in total having the same hash prefix of length X.

Look at the Probability table. For example for hash hex string of length 8 the probability of having a collision reaches 1 % when the repository has just about 9300 items (git commits). For 110 000 commits the probability is 75 %. But if you have hash hex string of length 12 the probability of collision in 100 000 commits is below 0.1 %.

Haubergeon answered 20/8, 2016 at 20:13 Comment(0)
B
2

Git version 2.11 (or perhaps 2.12?) will contain a feature that adapts the number of characters used in short identifiers (e.g. git log --oneline) to the size of the project. Once you use such version of Git, the answer to your question can be "pick whatever length Git gives you with git log --oneline, it's safe enough".

For more details, see Changing the default for “core.abbrev”? discussion in Git Rev News edition 20 and commit bb188d00f7.

Bradawl answered 24/10, 2016 at 14:6 Comment(1)
The problem with this approach is that big projects often start off as small ones.Alveta

© 2022 - 2024 — McMap. All rights reserved.