In my repo, how long must the longest hash prefix be to prevent any overlap?

Asked 4/9, 2015 at 20:14 Answered 24/5, 2016 at 1:49

Solved git linux-kernel sha1 git-log git-rev-list

The --abbrev-commit flag can be used in conjunction with git log and git rev-list in order to show partial prefixes instead of the full 40-character SHA-1 hashes of commit objects. According to the Pro Git book,

it defaults to using seven characters but makes them longer if necessary to keep the SHA-1 unambiguous [...]

Additionally, short SHAs are at least 4-character long. Still according to the Pro Git book,

Generally, eight to ten characters are more than enough to be unique within a project.

As an example, the Linux kernel, which is a pretty large project with over 450k commits and 3.6 million objects, has no two objects whose SHA-1s overlap more than the first 11 characters.

Since the length of the longest prefix required to prevent any overlap among all prefix hashes of commit objects (11, in the case of the Linux kernel) is a crude indicator of a repo's size, I'd like to programmatically determine the corresponding quantity in my own local repository. How can I do that?

Plunder answered 4/9, 2015 at 20:14 Comment(6)

Do you mean https://mcmap.net/q/56564/-how-much-of-a-git-sha-is-generally-considered-necessary-to-uniquely-identify-a-change-in-a-given-codebase? – Withrow 4/9, 2015 at 20:23

@ArkadiuszDrabczyk Not exactly. Your link only gives a method for determining how short the prefix of a given commit hash can be in order to avoid overlap with some other hash. I'm asking for the maximum of that quantity over all commit hashes in the repository. – Plunder 4/9, 2015 at 20:26

You can't rigorously determine the required prefix length without examining all the commits in a repo. In principle, a repo could have just two commits that are identical in their first 39 characters. And the actual required length could change with the next commit. – Stutter 4/9, 2015 at 21:30

@KeithThompson I know. The approach outlined in my answer does examine all commits. – Plunder 4/9, 2015 at 22:10

The default that the Linux kernel uses (the biggest user of git, for obvious reasons -- they invented git so they could version control the kernel) is 12 characters. It's unlikely that you'll have a collision with 12 characters, but you can always exhaustively check what the absolute minimum is for a set of commits. – Higinbotham 5/9, 2015 at 13:49

@Higinbotham [...] you can always exhaustively check what the absolute minimum is for a set of commits. That's the idea. Check out my answer. – Plunder 15/1, 2016 at 15:11

The following shell script, when run in a local repo, prints the length of the longest prefix required to prevent any overlap among all prefix hashes of commit objects of that repository.

MAX_LENGTH=4;

git rev-list --abbrev=4 --abbrev-commit --all | \
  ( while read -r line; do
      if [ ${#line} -gt $MAX_LENGTH ]; then
        MAX_LENGTH=${#line};
      fi
    done && printf %s\\n "$MAX_LENGTH"
  )

The last time I edited this answer, the script printed

"9" when run in a clone of the Git-project repo,
"9" when run in a clone of the OpenStack repo,
"11" when run in a clone of the Linux-kernel repo.

Plunder answered 4/9, 2015 at 20:28 Comment(0)

Jubob's script is great, upvoted.

If you want to get an idea of the distribution of minimum-commit-hash-length, you can run this one-liner:

git rev-list --abbrev=4 --abbrev-commit --all | ( while read -r line; do echo ${#line}; done; ) | sort -n | uniq -c

For the git project itself today (git-on-git), this yields something like:

... yielding 1788 commits that can be represented uniquely with a 4-char hash (or lower, this is Git's minimum abbrev), and 4 commits which require 9-of-40 characters of the hash in-order to uniquely select them.

By comparison, a much larger project such as the Linux kernel, has this distribution today:

So with a database of nearly 5 million objects and 600k commits, there's 3 commits currently requiring 11 of 40 hexadecimal digits to distinguish them from all other commits.

Nosy answered 24/5, 2016 at 1:49 Comment(2)

Nice. For a repo I tried this in I got a line like this: 9 1. That is, there is one item that requires hash of length 9. Not 2 or more, but 1. So if I abbreviate that manually to length 7, I don't get a conflict when checking that out etc. Is this because of some other internals/objects with a sha that does not represent a commit? – Leveloff 8/12, 2022 at 14:22

@johan-lundberg well observed, and the answer can probably be improved to point out that there is an asymmetry here in what Git offers with these abbreviations - my git rev-list doesn't include the --objects option ... if it did, you'd see some (unabbreviated) sha1s that belong to blobs and trees. When Git abbreviates commit sha1s, it won't allow ambiguity with these blob/tree hashes either ... so your length-9 commit sha1, despite being unique for commits, when shortened will collide with an object - try git show <7 or 8 length sha1>. – Nosy 15/12, 2022 at 6:32

Recommended topics

Hot tags