How can I uniquely identify a git repository
Asked Answered
D

1

12

I would like to create a tool that checks if I already have a local clone of a remote repository before cloning said repository. To do this, I need a way of testing if B is the same as repository A -- by which I guess i mean they have mergeable histories. B might be named differently than A, and might have additional branches -- the usual use cases.

Is there a way to do this? I have a tentative idea how to do it, but I thought perhaps someone here has a definitive answer.

Tentative idea

Get a list of branches and search for common branches (by hash). Then for the common branches, check that the initial commits are the same (by hash). At that point I would say 'good enough'. I figure I'm okay unless someone has been messing with history, which use-case I'm willing to neglect. To do this though, I need a way of getting the branch and commit information from the remote repository, without doing a clone. I can solve this using ssh & bash, but a git-only solution would be preferable.

Feedback on the half-baked idea is also welcome.

Why this is not a duplicate of Git repository unique id

The referenced question is looking for a unique repository id, or a way of creating one. No such beast exists, and even if it did, it is questionable if it would be relevant here, since I want to determine if two repositories have mergeable histories (i.e. I could fetch and merge between the two) -- a slightly better defined problem. I'm willing to ignore the possibilty that a user has modified history, but would love to hear how to handle that case as well.

Dicentra answered 19/1, 2016 at 10:27 Comment(4)
please do post your "tentative idea" to show that you actually did think about this; some people don't really think before asking questions :)Ciliata
Possible duplicate of Git repository unique idCiliata
I read the above question and its answers -- There is a bit of overlap, but the questions differ enough that the answers there don't really apply. I did get a useful hint there however -- the suggestion to use git notes could be an interesting approach. I would prefer a non-intrusive approach however.Dicentra
If you want a non-intrusive solution; my approach just using the first SHA-1 will work just fine (again; if you never change it afterwards, but that is very unlikely)Ciliata
C
15

As you can see in the related question; there is NO unique identification for a git repository. However; you could just compare the SHA-1 of the first commit on the master branch; that should suffice in 99.999% of all cases (supposing that the first commit will never be changed).

And if you want to be even more sure, you could consider using also the SHA-1 of the second commit; again supposing it will never change :). with the SHA-1 of the first two commits; I guess you have about 1 / 2^320 = 4.7*10^-97 chance of being wrong ...

If you are not sure there is even a master branch; you could suppose you have only one parentless root commit, and take its SHA-1. You can use this command to get the root commit (or commits):

git rev-list --parents HEAD | egrep "^[a-f0-9]{40}$"

( copied from this answer)

or (easier to understand, thanks @TomHale):

git rev-list --parents HEAD | tail -1
Ciliata answered 19/1, 2016 at 10:55 Comment(9)
Well, that's basically an over-simplified version of my tentative idea. It will fail if there is no "master" branch. You could say the 'default' branch (the branch pointed to by HEAD ), but that will fail if the cloned repo doesn't know about that branch -- which can happen with this workflow: consider A has branches b1 and b2: git clone A B -b2; git clone B C Now git remote -a will only show remotes/b2. My tentative approach at least gets that case right. But perhaps someone sees an improvement that can be made?Dicentra
That sounds like a winner. I'll check it out.Dicentra
git rev-list --parents HEAD | tail -1 is easier to understand, faster, and achieves the same effectLaundryman
indeed; lots easier :) Added that to my answer; I kept the other; since maybe there might be cases where the parentless commit is not the last in the list?Ciliata
This answer misses the birthday paradox; you're way more likely to get a collision than the math above suggests, but it's still way better than the 99.999% reliability stated in the answer.Major
@FilipHaglund thanks for the comment; juste for the sake of knowing; could you tell me in what way it is more probable?Ciliata
@ChrisMaes you're assuming that the sha1 of one repo is some constant c, when it actually could be 2^160 constants, all equally valid, and you have to compare the other hash to each of those values one at a time. 1-(probability of missing this particular c, for each possible c) rather than 1-(probability of missing the single constant c) en.wikipedia.org/wiki/Birthday_problemMajor
I wanted to add that I just used this method, but I was looking for a way to do this without a git client present (at this stage we don't want to have to find the location of a git executable or make that configurable or bundle a git client). I found that you can get this value from the first line of "<repo>/.git/logs/refs/heads/master". Of course this doesn't provide for no master branch (although all branch histories are in this dir), but this works for us right now and thought I'd share for others who might want to do the same.Upholsterer
There is at least one reason for which it is easy to have a conflict between two repositories: that is when you have used git-filter-branch to split its content (this is usually the case when you wish to reduce the size of a repository because it has grown over any acceptable threshold).Variscite

© 2022 - 2024 — McMap. All rights reserved.