git fast-import
(used by tools like git filter-repo
is indeed a good option, and, with Git 2.27 (Q2 2020), it is even faster.
The custom hash function used by "git fast-import
" has been replaced with the one from hashmap.c
, which gave a nice performance boost.
See commit d8410a8 (06 Apr 2020) by Jeff King (peff
).
(Merged by Junio C Hamano -- gitster
-- in commit 6ae3c79, 28 Apr 2020)
fast-import
: replace custom hash with hashmap.c
Signed-off-by: Jeff King
We use a custom hash in fast-import to store the set of objects we've imported so far. It has a fixed set of 2^16 buckets and chains any collisions with a linked list.
As the number of objects grows larger than that, the load factor increases and we degrade to O(n)
lookups and O(n^2)
insertions.
We can scale better by using our hashmap.c
implementation, which will resize the bucket count as we grow.
This does incur an extra memory cost of 8 bytes per object, as hashmap stores the integer hash value for each entry in its hashmap_entry
struct (which we really don't care about here, because we're just reusing the embedded object hash).
But I think the numbers below justify this (and our per-object memory cost is already much higher).
I also looked at using khash (here, see article), but it seemed to perform slightly worse than hashmap at all sizes, and worse even than the existing code for small sizes.
It's also awkward to use here, because we want to look up a "struct object_entry
" from a "struct object_id
", and it doesn't handle mismatched keys as well.
Making a mapping of object_id
to object_entry
would be more natural, but that would require pulling the embedded oid
out of the object_entry
or incurring an extra 32 bytes per object.
In a synthetic test creating as many cheap, tiny objects as possible
perl -e '
my $bits = shift;
my $nr = 2**$bits;
for (my $i = 0; $i < $nr; $i++) {
print "blob\n";
print "data 4\n";
print pack("N", $i);
}
its | git fast-import
I got these results:
nr_objects master khash hashmap
2^20 0m4.317s 0m5.109s 0m3.890s
2^21 0m10.204s 0m9.702s 0m7.933s
2^22 0m27.159s 0m17.911s 0m16.751s
2^23 1m19.038s 0m35.080s 0m31.963s
2^24 4m18.766s 1m10.233s 1m6.793s
which points to hashmap as the winner.
We didn't have any perf tests for fast-export or fast-import, so I added one as a more real-world case.
It uses an export without blobs since that's significantly cheaper than a full one, but still is an interesting case people might use (e.g., for rewriting history).
It will emphasize this change in some ways (as a percentage we spend more time making objects and less shuffling blob bytes around) and less in others (the total object count is lower).
Here are the results for linux.git:
Test HEAD^ HEAD
----------------------------------------------------------------------------
9300.1: export (no-blobs) 67.64(66.96+0.67) 67.81(67.06+0.75) +0.3%
9300.2: import (no-blobs) 284.04(283.34+0.69) 198.09(196.01+0.92) -30.3%
It only has ~5.2M commits and trees, so this is a larger effect than I expected (the 2^23 case above only improved by 50s or so, but here we gained almost 90s).
This is probably due to actually performing more object lookups in a real import with trees and commits, as opposed to just dumping a bunch of blobs into a pack.
Git 2.37 (Q3 2022) illustrates how to check the size of those object_entry
.
See commit 14deb58 (28 Jun 2022) by Taylor Blau (ttaylorr
).
(Merged by Junio C Hamano -- gitster
-- in commit b59f04f, 13 Jul 2022)
pack-objects.h
: remove outdated pahole results
Signed-off-by: Taylor Blau
The size and padding of struct
object_entry`` is an important factor in determining the memory usage of pack-objects
.
For this reason, 3b13a5f (pack-objects
: reorder members to shrink struct object_entry, 2018-04-14, Git v2.18.0-rc0 -- merge listed in batch #6) (pack-objects: reorder members to shrink struct object_entry,
2018-04-14) added a comment containing some information from pahole
indicating the size and padding of that struct.
Unfortunately, this comment hasn't been updated since 9ac3f0e ("pack-objects
: fix performance issues on packing large deltas", 2018-07-22, Git v2.19.0-rc1 -- merge), despite the size of this struct changing many times since that commit.
To see just how often the size of object_entry
changes, I skimmed the first-parent history with this script:
for sha in $(git rev-list --first-parent --reverse 9ac3f0e..)
do
echo -n "$sha "
git checkout -q $sha
make -s pack-objects.o 2>/dev/null
pahole -C object_entry pack-objects.o | sed -n \
-e 's/\/\* size: \([0-9]*\).*/size \1/p' \
-e 's/\/\*.*padding: \([0-9]*\).*/padding \1/p' | xargs
done | uniq -f1
In between each merge, the size of object_entry
changes too often to record every instance here.
But the important merges (along with their corresponding sizes and bit paddings) in chronological order are:
ad635e82d6 (Merge branch 'nd/pack-objects-pack-struct', 2018-05-23) size 80 padding 4
29d9e3e2c4 (Merge branch 'nd/pack-deltify-regression-fix', 2018-08-22) size 80 padding 9
3ebdef2e1b (Merge branch 'jk/pack-delta-reuse-with-bitmap', 2018-09-17) size 80 padding 8
33e4ae9c50 (Merge branch 'bc/sha-256', 2019-01-29) size 96 padding 8
(indicating that the current size of the struct is 96 bytes, with 8 padding bits).
Even though this comment was written in a good spirit, it is updated infrequently enough that it serves to confuse rather than to encourage contributors to update the appropriate values when the modify the definition of object_entry
.
For that reason, eliminate the confusion by removing the comment altogether.
git add
run faster, other than maybe not adding so many files at once. – Pomelo