How to list ALL git objects in the database?
Asked Answered
S

8

66

Is there a better way of getting a raw list of SHA1s for ALL objects in a repository than:
ls .git/objects/??/\*
and
cat .git/objects/pack/*.idx | git show-index

I know about git rev-list --all but that only lists commit objects that are referenced by .git/refs, and I'm looking for everything, including unreferenced objects that are created by git-hash-object, git-mktree etc.

Sundsvall answered 8/9, 2011 at 13:19 Comment(4)
I was just doing this yesterday on a test repo - used ls -alR .git/objects, but I agree that's not ideal... +1Bilharziasis
On the other hand, if you're looking for unreferenced objects in particular, git-fsck [--unreachable] may be of interest...Bilharziasis
...and ls will only give you the loose objects, not the packed ones, which could also be unreferenced -- hence the git show-index in my questionSundsvall
With Git 2.19 (Q3 2018), git cat-file --batch-check --batch-all-objects --unordered is fairly fast. See my answer below.Driftwood
A
10

Edit: Aristotle posted an even better answer, which should be marked as correct.

Edit: the script contained a syntax error, missing backslash at the end of the grep -v line

Mark's answer worked for me, after a few modifications:

  • Used --git-dir instead of --show-cdup to support bare repos
  • Avoided error when there are no packs
  • Used perl because OS X Mountain Lion's BSD-style sed doesn't support -r

#!/bin/sh

set -e

cd "$(git rev-parse --git-dir)"

# Find all the objects that are in packs:

find objects/pack -name 'pack-*.idx' | while read p ; do
    git show-index < $p | cut -f 2 -d ' '
done

# And now find all loose objects:

find objects/ \
    | egrep '[0-9a-f]{38}' \
    | grep -v /pack/ \
    | perl -pe 's:^.*([0-9a-f][0-9a-f])/([0-9a-f]{38}):\1\2:' \
;
Anciently answered 14/8, 2012 at 16:39 Comment(3)
Thanks. I’ve improved on this in my answer.Zoom
FYI, this script also erroneously includes entries for the packfiles themselves in the latter half, e.g. objects/pack/pack-23fd2d5638f6cd3bf2a433de4ea3cc26ebe7bdfa.idxDiphase
@EthanT Thanks. I added grep -v which fixes it for me. Looks like Aristotle's answer, which I recommend anyway, doesn't have that error.Anciently
B
48

Try

 git rev-list --objects --all

Edit Josh made a good point:

 git rev-list --objects -g --no-walk --all

list objects reachable from the ref-logs.

To see all objects in unreachable commits as well:

 git rev-list --objects --no-walk \
      $(git fsck --unreachable |
        grep '^unreachable commit' |
        cut -d' ' -f3)

Putting it all together, to really get all objects in the output format of rev-list --objects, you need something like

{
    git rev-list --objects --all
    git rev-list --objects -g --no-walk --all
    git rev-list --objects --no-walk \
        $(git fsck --unreachable |
          grep '^unreachable commit' |
          cut -d' ' -f3)
} | sort | uniq

To sort the output in slightly more useful way (by path for tree/blobs, commits first) use an additional | sort -k2 which will group all different blobs (revisions) for identical paths.

Boutwell answered 8/9, 2011 at 14:51 Comment(12)
That only lists the objects which are reachable from the refs/. The reflogs and the unreachable objects are not included.Durant
@Josh, others: extended the answer to be somehwat less shallow. I hope it didn't just get too intimidating. I'm pretty sure the APIs contain superior methods to 'get' at the object database more succinctlyBoutwell
@sehe: I have included a (somewhat modified) version of your script in my git scripts at github.com/johnbartholomew/gitvoodoo. I would like to release these scripts under GPLv3 -- could you tell me if this is acceptable to you? (the origin of the code is noted in the README and in a comment near the top of the script). Thanks for the useful extension!While
@JohnBartholomew: that's ok (the SO license is CC-wiki+SA). I'd seriously look at Marks answer too if you are going to distribute general purpose git scripts :)Boutwell
@sehe: Thanks. I originally used Mark's answer, but yours has the advantage that walking with rev-list allows extraction of things like a tree path for each blob (there isn't a unique one of course, but just having a valid one is informative).While
@sehe: FYI, GNU claim CC-SA isn't GPL compatible, so permission to include code posted here in a GPLv3 work isn't automatic.Combings
@Combings Oh. Can you tell I'm not a lawyer :)Boutwell
This shows which blob path corresponds to each SHA, which is awesome. Is there a nice way to also help easily identify commits and trees? E.g. tree by path, commit by first line of commit message.Sneak
Hmm...this Answer is detailed and precise, but looking at the next Answer, why wouldn't I just use git cat-file --batch-check --batch-all-objects instead of this longer script? The author of that answer claims all objects are listed. Could you comment on the differences, if any?Emiliaemiliaromagna
@Emiliaemiliaromagna I suppose it's just that seven years ago those options did not exist. If even \@Driftwood is okay with it (see his answer), I'm pretty sure everything is fine!Boutwell
@Boutwell Ah I missed the dates on the Answers - thanks!Emiliaemiliaromagna
git rev-list --objects --missing=allow-any --all if you want to prevent sparse repos from reaching out to the network to pull in missing tree objects. Also consider --missing=print instead of --missing=allow-anyCursory
U
46

I don’t know since when this option exists but you can

git cat-file --batch-check --batch-all-objects

This gives you, according to the man page,

all objects in the repository and any alternate object stores (not just reachable objects)

(emphasis mine).

Sample output:

$ git cat-file --batch-check --batch-all-objects
64a77169fe44d06b082cbe52478b3539cb333d45 tree 34
6692c9c6e231b1dfd5594dd59b32001b70060f19 commit 237
740481b1d3ce7de99ed26f7db6687f83ee221d67 blob 50
0e5814c4da88c647652df8b1bd91578f7538c65f tag 200

As can be seen above, this yields the object type and it’s size together with each hash but you can easily remove this information, e.g. with

git cat-file --batch-check --batch-all-objects | cut -d' ' -f1

or by giving a custom format to --batch-check.

Edit: If you don't care about the order you can (since Git 2.19) add the flag --unordered to speed things up. See VonC's answer for more details.

Unwarranted answered 22/12, 2017 at 21:8 Comment(2)
I know that thank-you comments are frowned upon on StackOverflow, but you have saved me one week worth of work with your answer. My installation was FUBAR and reflog and ref-list didn't work.Sagesagebrush
This answer is superior to others in that the output also shows the type and size of an object. (Updated the answer with a sample output to make this clear as I have been looking for this command for a while now.)Janejanean
D
19

The git cat-file --batch-check --batch-all-objects command, suggested in Erki Der Loony's answer, can be made faster with the new Git 2.19 (Q3 2018) option --unordered.

The API to iterate over all objects learned to optionally list objects in the order they appear in packfiles, which helps locality of access if the caller accesses these objects while as objects are enumerated.

See commit 0889aae, commit 79ed0a5, commit 54d2f0d, commit ced9fff (14 Aug 2018), and commit 0750bb5, commit b1adb38, commit aa2f5ef, commit 736eb88, commit 8b36155, commit a7ff6f5, commit 202e7f1 (10 Aug 2018) by Jeff King (peff). (Merged by Junio C Hamano -- gitster -- in commit 0c54cda, 20 Aug 2018)

cat-file: support "unordered" output for --batch-all-objects

If you're going to access the contents of every object in a packfile, it's generally much more efficient to do so in pack order, rather than in hash order. That increases the locality of access within the packfile, which in turn is friendlier to the delta base cache, since the packfile puts related deltas next to each other. By contrast, hash order is effectively random, since the sha1 has no discernible relationship to the content.

This patch introduces an "--unordered" option to cat-file which iterates over packs in pack-order under the hood. You can see the results when dumping all of the file content:

$ time ./git cat-file --batch-all-objects --buffer --batch | wc -c
6883195596

real 0m44.491s
user 0m42.902s
sys  0m5.230s

$ time ./git cat-file --unordered \
                      --batch-all-objects --buffer --batch | wc -c
  6883195596

real 0m6.075s
user 0m4.774s
sys  0m3.548s

Same output, different order, way faster. The same speed-up applies even if you end up accessing the object content in a different process, like:

git cat-file --batch-all-objects --buffer --batch-check |
grep blob |
git cat-file --batch='%(objectname) %(rest)' |
wc -c

Adding "--unordered" to the first command drops the runtime in git.git from 24s to 3.5s.

Side note: there are actually further speedups available for doing it all in-process now. Since we are outputting the object content during the actual pack iteration, we know where to find the object and could skip the extra lookup done by oid_object_info(). This patch stops short of that optimization since the underlying API isn't ready for us to make those sorts of direct requests.

So if --unordered is so much better, why not make it the default? Two reasons:

  1. We've promised in the documentation that --batch-all-objects outputs in hash order. Since cat-file is plumbing, people may be relying on that default, and we can't change it.

  2. It's actually slower for some cases. We have to compute the pack revindex to walk in pack order. And our de-duplication step uses an oidset, rather than a sort-and-dedup, which can end up being more expensive.

If we're just accessing the type and size of each object, for example, like:

git cat-file --batch-all-objects --buffer --batch-check

my best-of-five warm cache timings go from 900ms to 1100ms using --unordered. Though it's possible in a cold-cache or under memory pressure that we could do better, since we'd have better locality within the packfile.

And one final question: why is it "--unordered" and not "--pack-order"? The answer is again two-fold:

  1. "pack order" isn't a well-defined thing across the whole set of objects. We're hitting loose objects, as well as objects in multiple packs, and the only ordering we're promising is within a single pack. The rest is apparently random.

  2. The point here is optimization. So we don't want to promise any particular ordering, but only to say that we will choose an ordering which is likely to be efficient for accessing the object content. That leaves the door open for further changes in the future without having to add another compatibility option


It is even faster in Git 2.20 (Q4 2018) with:

See commit 8c84ae6, commit 8b2f8cb, commit 9249ca2, commit 22a1646, commit bf73282 (04 Oct 2018) by René Scharfe (rscharfe).
(Merged by Junio C Hamano -- gitster -- in commit 82d0a8c, 19 Oct 2018)

oidset: use khash

Reimplement oidset using khash.h in order to reduce its memory footprint and make it faster.

Performance of a command that mainly checks for duplicate objects using an oidset, with master and Clang 6.0.1:

$ cmd="./git-cat-file --batch-all-objects --unordered --buffer --batch-check='%(objectname)'"

$ /usr/bin/time $cmd >/dev/null
0.22user 0.03system 0:00.25elapsed 99%CPU (0avgtext+0avgdata 48484maxresident)k
0inputs+0outputs (0major+11204minor)pagefaults 0swaps

$ hyperfine "$cmd"
Benchmark #1: ./git-cat-file --batch-all-objects --unordered --buffer --batch-check='%(objectname)'

Time (mean ± σ):     250.0 ms ±   6.0 ms    [User: 225.9 ms, System: 23.6 ms]

Range (min … max):   242.0 ms … 261.1 ms

And with this patch:

$ /usr/bin/time $cmd >/dev/null
0.14user 0.00system 0:00.15elapsed 100%CPU (0avgtext+0avgdata 41396maxresident)k
0inputs+0outputs (0major+8318minor)pagefaults 0swaps

$ hyperfine "$cmd"
Benchmark #1: ./git-cat-file --batch-all-objects --unordered --buffer --batch-check='%(objectname)'

Time (mean ± σ):     151.9 ms ±   4.9 ms    [User: 130.5 ms, System: 21.2 ms]

Range (min … max):   148.2 ms … 170.4 ms

Git 2.21 (Q1 2019) optimizes further the codepath to write out commit-graph, by following the usual pattern of visiting objects in in-pack order.

See commit d7574c9 (19 Jan 2019) by Ævar Arnfjörð Bjarmason (avar).
(Merged by Junio C Hamano -- gitster -- in commit 04d67b6, 05 Feb 2019)

Slightly optimize the "commit-graph write" step by using FOR_EACH_OBJECT_PACK_ORDER with for_each_object_in_pack().
Derrick Stolee did his own tests on Windows showing a 2% improvement with a high degree of accuracy.


Git 2.23 (Q3 2019) improves "git rev-list --objects" which learned with "--no-object-names" option to squelch the path to the object that is used as a grouping hint for pack-objects.

See commit 42357b4 (19 Jun 2019) by Emily Shaffer (nasamuffin).
(Merged by Junio C Hamano -- gitster -- in commit f4f7e75, 09 Jul 2019)

rev-list: teach --no-object-names to enable piping

Allow easier parsing by cat-file by giving rev-list an option to print only the OID of a non-commit object without any additional information.
This is a short-term shim; later on, rev-list should be taught how to print the types of objects it finds in a format similar to cat-file's.

Before this commit, the output from rev-list needed to be massaged before being piped to cat-file, like so:

git rev-list --objects HEAD | cut -f 1 -d ' ' |
    git cat-file --batch-check

This was especially unexpected when dealing with root trees, as an invisible whitespace exists at the end of the OID:

git rev-list --objects --filter=tree:1 --max-count=1 HEAD |
    xargs -I% echo "AA%AA"

Now, it can be piped directly, as in the added test case:

git rev-list --objects --no-object-names HEAD | git cat-file --batch-check

So that is the difference between:

vonc@vonvb:~/gits/src/git$ git rev-list --objects HEAD~1..
9d418600f4d10dcbbfb0b5fdbc71d509e03ba719
590f2375e0f944e3b76a055acd2cb036823d4b44 
55d368920b2bba16689cb6d4aef2a09e8cfac8ef Documentation
9903384d43ab88f5a124bc667f8d6d3a8bce7dff Documentation/RelNotes
a63204ffe8a040479654c3e44db6c170feca2a58 Documentation/RelNotes/2.23.0.txt

And, with --no-object-name:

vonc@vonvb:~/gits/src/git$ git rev-list --objects --no-object-names HEAD~1..
9d418600f4d10dcbbfb0b5fdbc71d509e03ba719
590f2375e0f944e3b76a055acd2cb036823d4b44
55d368920b2bba16689cb6d4aef2a09e8cfac8ef
9903384d43ab88f5a124bc667f8d6d3a8bce7dff
a63204ffe8a040479654c3e44db6c170feca2a58

With Git 2.31 (Q1 2021), abstract accesses to in-core revindex that allows enumerating objects stored in a packfile in the order they appear in the pack, in preparation for introducing an on-disk precomputed revindex, which should speed up those operations.

See commit e5dcd78, commit d5bc7c6, commit 8389855, commit 1c3855f, commit 2891b43, commit b130aef, commit 0a7e364, commit fc150ca, commit 3a3f54d, commit 45bef5c, commit 78232bf, commit 011f3fd, commit a78a903, commit cf98f2e, commit 5766508, commit eb3fd99, commit 6a5c10c, commit 66cbd3e, commit 952fc68, commit f33fb6e (13 Jan 2021) by Taylor Blau (ttaylorr).
See commit 779412b (14 Jan 2021) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit bcaaf97, 25 Jan 2021)

pack-revindex: introduce a new API

Signed-off-by: Taylor Blau

In the next several patches, we will prepare for loading a reverse index either in memory (mapping the inverse of the .idx's contents in-core), or directly from a yet-to-be-introduced on-disk format.
To prepare for that, we'll introduce an API that avoids the caller explicitly indexing the revindex pointer in the packed_git structure.

There are four ways to interact with the reverse index.
Accordingly, four functions will be exported from 'pack-revindex.h' by the time that the existing API is removed.
A caller may:

  1. Load the pack's reverse index.
    This involves opening up the index, generating an array, and then sorting it.
    Since opening the index can fail, this function ('load_pack_revindex()') returns an int.
    Accordingly, it takes only a single argument: the 'struct packed_git``' the caller wants to build a reverse index for.
    This function is well-suited for both the current and new API.
    Callers will have to continue to open the reverse index explicitly, but this function will eventually learn how to detect and load a reverse index from the on-disk format, if one exists.
    Otherwise, it will fallback to generating one in memory from scratch.

  2. Convert a pack position into an offset.
    This operation is now called pack_pos_to_offset().
    It takes a pack and a position, and returns the corresponding off_t.
    Any error simply calls BUG(), since the callers are not well-suited to handle a failure and keep going.

  3. Convert a pack position into an index position.
    Same as above; this takes a pack and a position, and returns a uint32_t.
    This operation is known as pack_pos_to_index().
    The same thinking about error conditions applies here as well.

  4. Find the pack position for a given offset.
    This operation is now known as offset_to_pack_pos().
    It takes a pack, an offset, and a pointer to a uint32_t where the position is written, if an object exists at that offset.
    Otherwise, -1 is returned to indicate failure.

Unlike some of the callers that used to access '->offset' and '->nr' directly, the error checking around this call is somewhat more robust.
This is important since callers should always pass an offset which points at the boundary of two objects.
The API, unlike direct access, enforces that that is the case.

This will become important in a subsequent patch where a caller which does not but could check the return value treats the signed -1 from find_revindex_position() as an index into the 'revindex' array.

Two design warts are carried over into the new API:

  • Asking for the index position of an out-of-bounds object will result in a BUG() (since no such object exists), but asking for the offset of the non-existent object at the end of the pack returns the total size of the pack.
    This makes it convenient for callers who always want to take the difference of two adjacent object's offsets (to compute the on-disk size) but don't want to worry about boundaries at the end of the pack.
  • offset_to_pack_pos() lazily loads the reverse index, but pack_pos_to_index() doesn't (callers of the former are well-suited to handle errors, but callers of the latter are not).

With Git 2.32 (Q2 2021), "git (branch|tag) --format=..." has been micro-optimized.

See commit 844c3f0 (20 Apr 2021), and commit 22f69a8 (19 Apr 2021) by ZheNing Hu (adlternative).
(Merged by Junio C Hamano -- gitster -- in commit c108c8c, 07 May 2021)

ref-filter: reuse output buffer

Helped-by: Junio C Hamano
Helped-by: Jeff King
Helped-by: René Scharfe
Signed-off-by: ZheNing Hu

When we use git for-each-ref(man), every ref will allocate its own output strbuf and error strbuf.
But we can reuse the final strbuf for each step ref's output.
The error buffer will also be reused, despite the fact that the git will exit when format_ref_array_item() return a non-zero value and output the contents of the error buffer.

The performance for git for-each-ref on the Git repository itself with performance testing tool hyperfine changes from 23.7 ms ± 0.9 ms to 22.2 ms ± 1.0 ms.
Optimization is relatively minor.

At the same time, we apply this optimization to git tag -l(man) and git branch -l(man).

This approach is similar to the one used by 79ed0a5 ("cat-file: use a single strbuf for all output", 2018-08-14, Git v2.19.0-rc0 -- merge) to speed up the cat-file builtin.

Driftwood answered 21/8, 2018 at 21:1 Comment(1)
Z
16

This is a more correct, simpler, and faster rendition of the script from the answers by Mark and by willkill.

  • It uses rev-parse --git-path to find the objects directory even in a more complex Git repository setup (e.g. in a multi-worktree situation or whatnot).

  • It avoids all unnecessary use of find, grep, perl, sed.

  • If works gracefully even if you have no loose objects or no packs (or neither… if you’re inclined to run this on a fresh repository).

  • It does, however, require a Bash from this millennium 😊 (2.02 or newer, specifically, for the extglob bit).

Share and enjoy.

#!/bin/bash
set -e
shopt -s nullglob extglob

cd "`git rev-parse --git-path objects`"

# packed objects
for p in pack/pack-*([0-9a-f]).idx ; do
    git show-index < $p | cut -f 2 -d ' '
done

# loose objects
for o in [0-9a-f][0-9a-f]/*([0-9a-f]) ; do
    echo ${o/\/}
done
Zoom answered 28/6, 2016 at 18:41 Comment(3)
Nice. I like the "bash for this millenium" caveat. +1Driftwood
If the dependency on extglob is a problem, it probably isn't needed: for i in pack/*.idx; do git show-index < $i | cut -f 2 -d' '; done; for i in ??/*; do echo ${i/\/}; doneMuldoon
@Waxrat: That leaves the extglob pattern for loose objects. Also, the $p in my code is unquoted because the use of extglob syntax means there is no way for the variable to contain a filename that will require quoting. If you switch to a glob that can pick up IFS characters, then you also ought to add quotes around uses of the values. — But where would extglob be a problem?Zoom
A
10

Edit: Aristotle posted an even better answer, which should be marked as correct.

Edit: the script contained a syntax error, missing backslash at the end of the grep -v line

Mark's answer worked for me, after a few modifications:

  • Used --git-dir instead of --show-cdup to support bare repos
  • Avoided error when there are no packs
  • Used perl because OS X Mountain Lion's BSD-style sed doesn't support -r

#!/bin/sh

set -e

cd "$(git rev-parse --git-dir)"

# Find all the objects that are in packs:

find objects/pack -name 'pack-*.idx' | while read p ; do
    git show-index < $p | cut -f 2 -d ' '
done

# And now find all loose objects:

find objects/ \
    | egrep '[0-9a-f]{38}' \
    | grep -v /pack/ \
    | perl -pe 's:^.*([0-9a-f][0-9a-f])/([0-9a-f]{38}):\1\2:' \
;
Anciently answered 14/8, 2012 at 16:39 Comment(3)
Thanks. I’ve improved on this in my answer.Zoom
FYI, this script also erroneously includes entries for the packfiles themselves in the latter half, e.g. objects/pack/pack-23fd2d5638f6cd3bf2a433de4ea3cc26ebe7bdfa.idxDiphase
@EthanT Thanks. I added grep -v which fixes it for me. Looks like Aristotle's answer, which I recommend anyway, doesn't have that error.Anciently
P
7

I don't know of an obviously better way than just looking at all the loose object files and the indices of all pack files. The format of the git repository is very stable, and with this method you don't have to rely on having exactly the right options to git fsck, which is classed as porcelain. I think this method is faster, as well. The following script shows all the objects in a repository:

#!/bin/sh

set -e

cd "$(git rev-parse --show-cdup)"

# Find all the objects that are in packs:

for p in .git/objects/pack/pack-*.idx
do
    git show-index < $p | cut -f 2 -d ' '
done

# And now find all loose objects:

find .git/objects/ | egrep '[0-9a-f]{38}' | \
  sed -r 's,^.*([0-9a-f][0-9a-f])/([0-9a-f]{38}),\1\2,'

(My original version of this script was based on this useful script to find the largest objects in your pack files, but I switched to using git show-index, as suggested in your question.)

I've made this script into a GitHub gist.

Palomo answered 12/9, 2011 at 16:0 Comment(1)
Thanks. I’ve improved on this in my answer.Zoom
S
4

Another useful option is to use git verify-pack -v <packfile>

verify-pack -v lists all objects in the database along with their object type.

Saintpierre answered 13/9, 2013 at 18:37 Comment(1)
In some weird situation, git verify-pack -v .git/objects/pack/pack-….idx was the only command that could could confirm that a certain large file was still referenced somewhere, other than git show Some-SHA1. Turns out I had not passed the right parameters to git reflog expire... (and yet, the reflog appeared completely empty!)Fates
C
1

Combine Erki der Loony and sehe solution we can get almost all blob object file name (except for files excluded in git):

git cat-file --batch-check --batch-all-objects | grep blob | cut -d" " -f1 | xargs -n1 git rev-list --objects -g --no-walk --all > .recovered/allblobobject.txt
cat .recovered/allblobobject.txt | sort | uniq > .recovered/allblobuniqobject.txt

This blob to file name map can be used for recovering disappeared files retrieved via:

git fsck --full --no-reflogs --unreachable --lost-found | grep blob | cut -d" " -f3 > .recovered/bloblist.txt
for /F "tokens=*" %A in (.recovered/bloblist.txt) do (git cat-file -p %A > .recovered/%A)
Conjunctive answered 24/1, 2021 at 4:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.