Why Isn't There A Git Clone Specific Commit Option?

Asked 1/10, 2014 at 6:18 Answered 1/10, 2014 at 7:3

In light of a recent question on SO, I am wondering why isn't there an option in git clone such that the HEAD pointer of the newly created branch will point to a specified commit? In say question above, OP is trying to provide instructions on the specific commit his users should clone.

Note that this question is not about How To Clone To A Particular Version using reset; but about why isn't there?

Bushwhacker answered 1/10, 2014 at 6:18 Comment(0)

Two answers so far (at the time I wrote this, now there are more) are correct in what they say, but don't really answer the "why" question. Of course, the "why" question is really hard to answer, except by the authors of the various bits of Git (and even then, what if two frequent Git contributors gave two different answers?).

Still, considering Git's "philosophy" as it were, in general, the various transfer protocols work by naming a reference. If they provide an SHA-1, it's the SHA-1 of that reference. For someone who does not already have direct (e.g., command-line) access to the repository, none¹ of the built in commands allow one to refer to commits by ID. The closest thing I can find to a reason for this—and it is actually a good reason²—is this bit in the git upload-archive documentation:

SECURITY

In order to protect the privacy of objects that have been removed from history but may not yet have been pruned, git-upload-archive avoids serving archives for commits and trees that are not reachable from the repository's refs. However, because calculating object reachability is computationally expensive, git-upload-archive implements a stricter but easier-to-check set of rules ...

However, it goes on to say:

If the config option uploadArchive.allowUnreachable is true, these rules are ignored, and clients may use arbitrary sha1 expressions. This is useful if you do not care about the privacy of unreachable objects, or if your object database is already publicly available for access via non-smart-http.

which is particularly interesting since git clone gets all reachable objects in the first place, after which your local clone could trivially check out a commit by SHA-1 ID (and create a local branch name pointing to that ID if desired, or just leave your clone in "detached HEAD" mode).

Given these two cross-currents, I think the real answer to "why", at this point, is "nobody cares enough to add it". :-) The privacy argument is valid, but there is no reason that git clone could not check out a commit by ID after cloning, just as it can be told to check out some branch other than master³ with git clone -b .... The only drawback to allowing -b sha1 is that Git cannot check up front (before the cloning process begins) whether sha1 will be received. It can check reference names, since those are transferred (along with their branch tips or other SHA-1 values) up front, so git clone -b nonexistentbranch ssh://... terminates quickly and does not create the copy:

fatal: Remote branch nonexistentbranch not found in upstream origin
fatal: The remote end hung up unexpectedly

If -b allowed an ID, you'd get the whole clone, then it would have to tell you: "oh gosh, sorry, can't check out that ID, I'll leave you on master instead" or whatever. (Which is more or less what happens now with a busted submodule.)

¹While git upload-archive now enforces this "privacy" rule, this was not always the case (it was introduced in version 1.7.8.1); and many (most?) git-web servers, including the one distributed with Git itself, allow viewing by arbitrary ID. This is probably why allowUnreachable was added to upload-archive a few years after the "only by ref name" code was added (but note that releases of Git after 1.7.8 and before 2.0.0 have no way to loosen the rules). Hence, while the "security" idea is valid, there was a period (pre 1.7.8.1) when it was not enforced.

²There are numerous ways to "leak" ostensibly private data out of a Git repository. A new file, Documentation/transfer-data-leaks, is about to appear in Git 2.11.1, while Git 2.11.0 added some internal features (see commit 722ff7f87 among others) to immediately drop objects pushed but not accepted. Such objects are eventually garbage-collected, but that leaves them exposed for the duration.

³Actually, by default git clone makes a local check-out of the branch it thinks goes with the remote's HEAD reference. Usually that's master anyway, though.

Revert answered 1/10, 2014 at 7:3 Comment(7)

Would github.com/git/git/blob/… be an appropriate link? – Sailmaker 1/10, 2014 at 7:52

That allowUnreachable business seems fairly recent (Git 2.0+): github.com/git/git/commit/… – Sailmaker 1/10, 2014 at 7:56

And (following a link embedded in that link) the restrictions against arbitrary SHA-1 were introduced in 1.7.8.1. – Revert 1/10, 2014 at 8:1

So... 2011-12-21: three years ago. – Sailmaker 1/10, 2014 at 8:3

@VonC: footnote 1 edited to include versions, and tweak wording about dates. (I still think of everything after git 1.6 as "kind of new", I remember being stuck with 1.5.x on some corporate systems...) – Revert 1/10, 2014 at 8:7

The security section you quote is from git-upload-archive, which creates a tar/zip archive, not git-upload-pack that is part of the protocol. While you can't (directly) get a commit by ID from the server, a rogue client could send a branch update to the server, even pointing to an unreachable commit and the client need not deliver a commit since it already exists. Which is a long way to say that the protocol handlers do not follow the quoted security section. – Sanctified 1/10, 2014 at 20:32

@EdwardThomson: yes, that's why it's all noted as indirect, and the "security" aspect is incomplete at best. I think it qualifies as a philosophical reason, though. – Revert 1/10, 2014 at 20:55

Cloning a repo is a different operation than checkout. You don't "clone a specific commit". For convenience you can clone and then checkout a particular pre-existing branch at the same time, since that is what most people want. If that doesn't meet your needs (no branch for the particular SHA you want) just use or alias some form of

git clone -n <some repo> && cd <some repo> && git checkout SHA

Corot answered 1/10, 2014 at 6:31 Comment(0)

If your specific commit is referenced by a branch, you can do a:

git clone -b yourBranch /url/of/the/repo

The cloned repo will be directly at the commit referenced by that branch.

Sailmaker answered 1/10, 2014 at 6:23 Comment(4)

Right. In my case, the submodules we want to work with are associated to specific (older) commits in master. – Bushwhacker 1/10, 2014 at 6:32

you can just create a branch off the older commits at any time - git branch clone_me OLD_SHA – Corot 1/10, 2014 at 6:37

Another solution suggested elsewhere is to fork, then create a tag pointing the SHA you want, then use git submodules to track the tagged SHA. – Northeaster 1/4, 2016 at 16:29

@Northeaster I agree. I suspect it is one simple solution. – Sailmaker 1/4, 2016 at 19:49

As the other answers say, this is typically not much of an issue, but they don't say why you can't clone a specific commit. The answer is security.

If you accidentally push confidential information, and then force-push a fixed history, the commits with the confidential information will still be stored on the server, until the server's Git's garbage collector finds it is no longer needed. If the hash is known (it might for example be available in logs), a malicious user might request the specific commit that shouldn't have been pushed, even if you were able to verify that when you force-pushed the fixed history, nobody had fetched those commits yet.

Making sure you can only clone from refs makes sure that only "reachable" commits will be sent to the clients.

Natica answered 1/10, 2014 at 6:47 Comment(4)

What would stop a malicious user from pushing up a new branch that pointed to that commit ID? (The client can merrily create a new branch that points to a commit that the server already has without having to push up any other commits.) The malicious user could then clone and get the confidential commit. – Sanctified 1/10, 2014 at 20:13

@EdwardThomson Presumably a malicious user wouldn't have push access, but fair point. – Natica 1/10, 2014 at 21:0

I presume that the idea is that the clone operation will filter out those commits that shouldn't be accessible? If so, why wouldn't it be possible to do that on the server instead of wasting bandwidth and precious time (as well as disk space), by sending the desired commit to the server? I'm currently watching the progress indicator of a linux kernel clone creeping forward with despair, knowing that the commit I want is way back in the past... – Telegraph 23/6, 2015 at 18:41

Security isn't a big point in git's design (otherwise they'd have made replacing SHA1 with a hash from the SHA2 family proirity). I.e. security can't be the main reason (it could be an auxiliary reason if two otherwise equivalent design alternatives were there, but having and not having a by-hash download aren't otherwise-equivalents alternatives). – Marinara 13/9, 2020 at 6:45

Recommended topics

Hot tags