I have a *lot* of source files to add to a git repo, how to make it fast
Asked Answered
B

2

5

I am looking here for inspiration on importing bulk files into a git repo faster, but not sure it's it.

Basically the situation is, I have >100 million files I want to commit to a git repo. I have broken them down into directories roughly 5 deep. To git add path/2/3 some levels deep takes about 5 minutes. Then to commit, then to publish. It takes a long time and could take months to commit all of these files.

Please don't get caught up on wondering why I'm storing them in git and if they are source files and if there is a better solution, etc.. I specifically want to know how much git can handle and if it can handle this many files in a more optimal way.

FYI these are all config files or CSV-like data files, some very large, most of them small.

If I try to commit the whole thing, or just a large fragment, it might take an hour to commit them all. But to publish it might take several hours and I have tried it already and often times the internet cuts out and boom you have to start over. So it's not a viable solution I don't think.

What I'm wondering is if there is a way to load all this stuff directly into git in one swoop, like you would load into a database with a database dump, and bypass all the git stuff it does when doing the commit. Then it creates a commit. Then somehow publish like rsync does where it's robust and doesn't break if the internet connection breaks. Then it would be like a normal upload.

Buttaro answered 28/5, 2019 at 5:1 Comment(2)
I don't know of a way to make git add run faster, other than maybe not adding so many files at once.Pomelo
The article you link to mentions that it could take several hours for 10k commits, so your options might be faster hardware (CPU and disk) or loading the files into a RAMDISK first. Assuming this is *nix you can try setting the permissions to 777.Guttle
C
2

There are few hard limits on the number of files—technically blob objects—that a Git database can store.1 There are a bunch of softer limits, though.

I have handy two fairly big repositories—FreeBSD and Linux—that weigh in at 5.7 and 6.7 million objects. This is far less than 100 million files: the Linux repository is about 1/15th that size, and even then there aren't that many files, as many of the objects are commits and trees.

Note that there's a difference between putting 100 million files into one commit, and putting 100 million files into 100 million commits, each of which stores one file. The former will require building an index that lists 100 million files, which is several gigabytes of index file and is likely to be slow, but then stores 100 million blobs, plus one tree object per directory, plus one commit. The latter would build a small index (of 1 file), make one commit using one tree holding one blob, then repeat 100 million times: the index would never be large but the repository would store 300 million objects: 100 million commits, each with 1 tree and 1 blob.

It's not immediately obvious where all the time is going. git add <path> requires:

  • compressing the contents of the file and creating a new blob object, or re-using an existing blob object if the compressed hash ID is that of an existing object; then
  • updating the index so that staging slot zero of the appropriate file name appears in the right position.

The index is sorted, so this update could conceivably be very fast—if the new file goes at the end of the index, a single however-many byte append would suffice—or incredibly slow: an insert at the front will be O(n2) on the number of entries already in the index, as all of them will have to move down. In practice, Git reads the index into memory, performs the operation there, and writes the index back, so it will probably be very slow once the index passes some size threshold (which will vary depending on the OS and underlying storage medium type / speed).

You may also need a great deal of disk space between packing objects. Modern Git will run git gc --auto after each commit, but between early Git and 2.17.0 (when it was fixed), git commit accidentally didn't. Given your situation you probably want to disable automatic git gc anyway, and run it at controlled intervals—or, as in the documentation you linked, use git fast-import to build a pack file without going through the normal Git channels. This will avoid the need for an index entirely (until you run git checkout to extract one of these commits, that is).


1The only real hard limit is that there are only 2160 possible hash IDs. However, you run into noticeably high hash-collision probability, on the order of 1 in 10-18—this is also many disk manufacturers' quoted uncorrected bit error rate—by the time you're up to about 1.7 quadrillion objects.

Chil answered 28/5, 2019 at 5:38 Comment(0)
S
0

git fast-import (used by tools like git filter-repo is indeed a good option, and, with Git 2.27 (Q2 2020), it is even faster.

The custom hash function used by "git fast-import" has been replaced with the one from hashmap.c, which gave a nice performance boost.

See commit d8410a8 (06 Apr 2020) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 6ae3c79, 28 Apr 2020)

fast-import: replace custom hash with hashmap.c

Signed-off-by: Jeff King

We use a custom hash in fast-import to store the set of objects we've imported so far. It has a fixed set of 2^16 buckets and chains any collisions with a linked list.
As the number of objects grows larger than that, the load factor increases and we degrade to O(n) lookups and O(n^2) insertions.

We can scale better by using our hashmap.c implementation, which will resize the bucket count as we grow.
This does incur an extra memory cost of 8 bytes per object, as hashmap stores the integer hash value for each entry in its hashmap_entry struct (which we really don't care about here, because we're just reusing the embedded object hash).
But I think the numbers below justify this (and our per-object memory cost is already much higher).

I also looked at using khash (here, see article), but it seemed to perform slightly worse than hashmap at all sizes, and worse even than the existing code for small sizes.
It's also awkward to use here, because we want to look up a "struct object_entry" from a "struct object_id", and it doesn't handle mismatched keys as well.
Making a mapping of object_id to object_entry would be more natural, but that would require pulling the embedded oid out of the object_entry or incurring an extra 32 bytes per object.

In a synthetic test creating as many cheap, tiny objects as possible

perl -e '
   my $bits = shift;
   my $nr = 2**$bits;

for (my $i = 0; $i < $nr; $i++) {
       print "blob\n";
       print "data 4\n";
       print pack("N", $i);
}
its | git fast-import

I got these results:

nr_objects   master       khash      hashmap

2^20 0m4.317s 0m5.109s 0m3.890s 2^21 0m10.204s 0m9.702s 0m7.933s 2^22 0m27.159s 0m17.911s 0m16.751s 2^23 1m19.038s 0m35.080s 0m31.963s 2^24 4m18.766s 1m10.233s 1m6.793s

which points to hashmap as the winner.

We didn't have any perf tests for fast-export or fast-import, so I added one as a more real-world case.
It uses an export without blobs since that's significantly cheaper than a full one, but still is an interesting case people might use (e.g., for rewriting history).
It will emphasize this change in some ways (as a percentage we spend more time making objects and less shuffling blob bytes around) and less in others (the total object count is lower).

Here are the results for linux.git:

Test                        HEAD^                 HEAD
----------------------------------------------------------------------------
9300.1: export (no-blobs)   67.64(66.96+0.67)     67.81(67.06+0.75) +0.3%
9300.2: import (no-blobs)   284.04(283.34+0.69)   198.09(196.01+0.92) -30.3%

It only has ~5.2M commits and trees, so this is a larger effect than I expected (the 2^23 case above only improved by 50s or so, but here we gained almost 90s).
This is probably due to actually performing more object lookups in a real import with trees and commits, as opposed to just dumping a bunch of blobs into a pack.


Git 2.37 (Q3 2022) illustrates how to check the size of those object_entry.

See commit 14deb58 (28 Jun 2022) by Taylor Blau (ttaylorr).
(Merged by Junio C Hamano -- gitster -- in commit b59f04f, 13 Jul 2022)

pack-objects.h: remove outdated pahole results

Signed-off-by: Taylor Blau

The size and padding of struct object_entry`` is an important factor in determining the memory usage of pack-objects.
For this reason, 3b13a5f (pack-objects: reorder members to shrink struct object_entry, 2018-04-14, Git v2.18.0-rc0 -- merge listed in batch #6) (pack-objects: reorder members to shrink struct object_entry, 2018-04-14) added a comment containing some information from pahole indicating the size and padding of that struct.

Unfortunately, this comment hasn't been updated since 9ac3f0e ("pack-objects: fix performance issues on packing large deltas", 2018-07-22, Git v2.19.0-rc1 -- merge), despite the size of this struct changing many times since that commit.

To see just how often the size of object_entry changes, I skimmed the first-parent history with this script:

for sha in $(git rev-list --first-parent --reverse 9ac3f0e..)
do
  echo -n "$sha "
  git checkout -q $sha
  make -s pack-objects.o 2>/dev/null
  pahole -C object_entry pack-objects.o | sed -n \
    -e 's/\/\* size: \([0-9]*\).*/size \1/p' \
    -e 's/\/\*.*padding: \([0-9]*\).*/padding \1/p' | xargs
done | uniq -f1

In between each merge, the size of object_entry changes too often to record every instance here.
But the important merges (along with their corresponding sizes and bit paddings) in chronological order are:

ad635e82d6 (Merge branch 'nd/pack-objects-pack-struct', 2018-05-23) size 80 padding 4
29d9e3e2c4 (Merge branch 'nd/pack-deltify-regression-fix', 2018-08-22) size 80 padding 9
3ebdef2e1b (Merge branch 'jk/pack-delta-reuse-with-bitmap', 2018-09-17) size 80 padding 8
33e4ae9c50 (Merge branch 'bc/sha-256', 2019-01-29) size 96 padding 8

(indicating that the current size of the struct is 96 bytes, with 8 padding bits).

Even though this comment was written in a good spirit, it is updated infrequently enough that it serves to confuse rather than to encourage contributors to update the appropriate values when the modify the definition of object_entry.

For that reason, eliminate the confusion by removing the comment altogether.

Schleswigholstein answered 1/5, 2020 at 20:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.