Git is really slow for 100,000 objects. Any fixes?
Asked Answered
K

11

74

I have a "fresh" git-svn repo (11.13 GB) that has over a 100,000 objects in it.

I have preformed

git fsck
git gc

on the repo after the initial checkout.

I then tried to do a

git status

The time it takes to do a git status is anywhere from 2m25.578s and 2m53.901s

I tested git status by issuing the command

time git status

5 times and all of the times ran between the two times listed above.

I am doing this on a Mac OS X, locally not through a VM.

There is no way it should be taking this long.

Any ideas? Help?

Thanks.

Edit

I have a co-worker sitting right next to me with a comparable box. Less RAM and running Debian with a jfs filesystem. His git status runs in .3 on the same repo (it is also a git-svn checkout).

Also, I recently changed my file permissions (to 777) on this folder and it brought the time down considerably (why, I have no clue). I can now get it done anywhere between 3 and 6 seconds. This is manageable, but still a pain.

Kallick answered 22/7, 2010 at 22:10 Comment(4)
how much ram do you have installed? and what kind of disk?Valenzuela
8gb of RAM Hitachi HTS543232L9SA02: Capacity: 320.07 GB (320,072,933,376 bytes)Kallick
How big is the repo (MB, not objects)? You're right, though, that it shouldn't take that long -- I have a repo with > 300K objects and "git status" takes .1 ms on a similar machine.Fuller
is 11.13 GB the size of .git or the whole repo with .git in it?Spiccato
K
38

It came down to a couple of items that I can see right now.

  1. git gc --aggressive
  2. Opening up file permissions to 777

There has to be something else going on, but this was the things that clearly made the biggest impact.

Kallick answered 26/7, 2010 at 23:3 Comment(8)
Some insight from Linus: metalinguist.wordpress.com/2007/12/06/…Resect
Before doing git gc --aggressive read the link of @CharlesL. comment to see why its not recommended and why it will possibly become undocumented. For me, it ended in an Out of memory fatal error and broke my local repo.Yingyingkow
-1, sorry. git gc --aggressive may have fixed your specific problem but it messes up good packs. A gc + an aggressive repack would have probably been better (see @Charles's link above), followed by a re-clone (git clone file:///Users/foo/bar/myrepo/.git newclone).Venetis
@CharlesL.: Is that guidance still correct? Seems like --aggressive was never removed, but now supports --depth and --window, which makes me suspect it's just an alternate way to do the work described as the "better approach" in that blog.Mouthwash
@CharlesL. That link is dead, archived page hereEncephalogram
@forresthopkinsa: And to be really pedantic, Linus' comment is actually here: Re: Git and GCC.Biocatalyst
@Moreaki Good callEncephalogram
Refer to https://mcmap.net/q/20477/-git-gc-aggressive-vs-git-repack for more details.Gothicism
E
23

git status has to look at every file in the repository every time. You can tell it to stop looking at trees that you aren't working on with

git update-index --assume-unchanged <trees to skip>

source

From the manpage:

When these flags are specified, the object names recorded for the paths are not updated. Instead, these options set and unset the "assume unchanged" bit for the paths. When the "assume unchanged" bit is on, git stops checking the working tree files for possible modifications, so you need to manually unset the bit to tell git when you change the working tree file. This is sometimes helpful when working with a big project on a filesystem that has very slow lstat(2) system call (e.g. cifs).

This option can be also used as a coarse file-level mechanism to ignore uncommitted changes in tracked files (akin to what .gitignore does for untracked files). Git will fail (gracefully) in case it needs to modify this file in the index e.g. when merging in a commit; thus, in case the assumed-untracked file is changed upstream, you will need to handle the situation manually.

Many operations in git depend on your filesystem to have an efficient lstat(2) implementation, so that st_mtime information for working tree files can be cheaply checked to see if the file contents have changed from the version recorded in the index file. Unfortunately, some filesystems have inefficient lstat(2). If your filesystem is one of them, you can set "assume unchanged" bit to paths you have not changed to cause git not to do this check. Note that setting this bit on a path does not mean git will check the contents of the file to see if it has changed — it makes git to omit any checking and assume it has not changed. When you make changes to working tree files, you have to explicitly tell git about it by dropping "assume unchanged" bit, either before or after you modify them.

...

In order to set "assume unchanged" bit, use --assume-unchanged option. To unset, use --no-assume-unchanged.

The command looks at core.ignorestat configuration variable. When this is true, paths updated with git update-index paths… and paths updated with other git commands that update both index and working tree (e.g. git apply --index, git checkout-index -u, and git read-tree -u) are automatically marked as "assume unchanged". Note that "assume unchanged" bit is not set if git update-index --refresh finds the working tree file matches the index (use git update-index --really-refresh if you want to mark them as "assume unchanged").


Now, clearly, this solution is only going to work if there are parts of the repo that you can conveniently ignore. I work on a project of similar size, and there are definitely large trees that I don't need to check on a regular basis. The semantics of git-status make it a generally O(n) problem (n in number of files). You need domain specific optimizations to do better than that.

Note that if you work in a stitching pattern, that is, if you integrate changes from upstream by merge instead of rebase, then this solution becomes less convenient, because a change to an --assume-unchanged object merging in from upstream becomes a merge conflict. You can avoid this problem with a rebasing workflow.

Erme answered 25/7, 2010 at 23:45 Comment(2)
It does not appear that you can do this for whole folders. You have to add files individually. This would not work if that is the case.Kallick
@Kallick do git ls-files dir dir2 -z | git update-index -z --stdinChordophone
A
10

For files you do not version, see also "UNTRACKED FILES AND PERFORMANCE" with git status.


git status should be quicker in Git 2.13 (Q2 2017), because of:

On that last point, see commit a33fc72 (14 Apr 2017) by Jeff Hostetler (jeffhostetler).
(Merged by Junio C Hamano -- gitster -- in commit cdfe138, 24 Apr 2017)

read-cache: force_verify_index_checksum

Teach git to skip verification of the SHA1-1 checksum at the end of the index file in verify_hdr() which is called from read_index() unless the "force_verify_index_checksum" global variable is set.

Teach fsck to force this verification.

The checksum verification is for detecting disk corruption, and for small projects, the time it takes to compute SHA-1 is not that significant, but for gigantic repositories this calculation adds significant time to every command.


Git 2.14 improves again git status performance by better taking into account the "untracked cache", which allows Git to skip reading the untracked directories if their stat data have not changed, using the mtime field of the stat structure.

See the Documentation/technical/index-format.txt for more on untracked cache.

See commit edf3b90 (08 May 2017) by David Turner (dturner-tw).
(Merged by Junio C Hamano -- gitster -- in commit fa0624f, 30 May 2017)

When "git checkout", "git merge", etc. manipulates the in-core index, various pieces of information in the index extensions are discarded from the original state, as it is usually not the case that they are kept up-to-date and in-sync with the operation on the main index.

The untracked cache extension is copied across these operations now, which would speed up "git status" (as long as the cache is properly invalidated).


More generally, writing to the cache will be also quicker with Git 2.14.x/2.15

See commit ce012de, commit b50386c, commit 3921a0b (21 Aug 2017) by Kevin Willford (``).
(Merged by Junio C Hamano -- gitster -- in commit 030faf2, 27 Aug 2017)

We used to spend more than necessary cycles allocating and freeing piece of memory while writing each index entry out.
This has been optimized.

[That] would save anywhere between 3-7% when the index had over a million entries with no performance degradation on small repos.


Update Dec. 2017: Git 2.16 (Q1 2018) will propose an additional enhancement, this time for git log, since the code to iterate over loose object files just got optimized.

See commit 163ee5e (04 Dec 2017) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 97e1f85, 13 Dec 2017)

sha1_file: use strbuf_add() instead of strbuf_addf()

Replace use of strbuf_addf() with strbuf_add() when enumerating loose objects in for_each_file_in_obj_subdir(). Since we already check the length and hex-values of the string before consuming the path, we can prevent extra computation by using the lower- level method.

One consumer of for_each_file_in_obj_subdir() is the abbreviation code. OID (object identifiers) abbreviations use a cached list of loose objects (per object subdirectory) to make repeated queries fast, but there is significant cache load time when there are many loose objects.

Most repositories do not have many loose objects before repacking, but in the GVFS case (see "Announcing GVFS (Git Virtual File System)") the repos can grow to have millions of loose objects.
Profiling 'git log' performance in Git For Windows on a GVFS-enabled repo with ~2.5 million loose objects revealed 12% of the CPU time was spent in strbuf_addf().

Add a new performance test to p4211-line-log.sh that is more sensitive to this cache-loading.
By limiting to 1000 commits, we more closely resemble user wait time when reading history into a pager.

For a copy of the Linux repo with two ~512 MB packfiles and ~572K loose objects, running 'git log --oneline --parents --raw -1000' had the following performance:

HEAD~1            HEAD
----------------------------------------
7.70(7.15+0.54)   7.44(7.09+0.29) -3.4%

Update March 2018: Git 2.17 will improve git status some more: see this answer.


Update: Git 2.20 (Q4 2018) adds Index Entry Offset Table (IEOT), which allows for git status to load the index faster.

See commit 77ff112, commit 3255089, commit abb4bb8, commit c780b9c, commit 3b1d9e0, commit 371ed0d (10 Oct 2018) by Ben Peart (benpeart).
See commit 252d079 (26 Sep 2018) by Nguyễn Thái Ngọc Duy (pclouds).
(Merged by Junio C Hamano -- gitster -- in commit e27bfaa, 19 Oct 2018)

read-cache: load cache entries on worker threads

This patch helps address the CPU cost of loading the index by utilizing the Index Entry Offset Table (IEOT) to divide loading and conversion of the cache entries across multiple threads in parallel.

I used p0002-read-cache.sh to generate some performance data:

Test w/100,000 files reduced the time by 32.24%
Test w/1,000,000 files reduced the time by -4.77%

Note that on the 1,000,000 files case, multi-threading the cache entry parsing does not yield a performance win. This is because the cost to parse the index extensions in this repo, far outweigh the cost of loading the cache entries.

That allows for:

config: add new index.threads config setting

Add support for a new index.threads config setting which will be used to control the threading code in do_read_index().

  • A value of 0 will tell the index code to automatically determine the correct number of threads to use.
    A value of 1 will make the code single threaded.
  • A value greater than 1 will set the maximum number of threads to use.

For testing purposes, this setting can be overwritten by setting the GIT_TEST_INDEX_THREADS=<n> environment variable to a value greater than 0.


Git 2.21 (Q1 2019) introduces a new improvement, with the update of the loose object cache, used to optimize existence look-up, which has been updated.

See commit 8be88db (07 Jan 2019), and commit 4cea1ce, commit d4e19e5, commit 0000d65 (06 Jan 2019) by René Scharfe (rscharfe).
(Merged by Junio C Hamano -- gitster -- in commit eb8638a, 18 Jan 2019)

object-store: use one oid_array per subdirectory for loose cache

The loose objects cache is filled one subdirectory at a time as needed.
It is stored in an oid_array, which has to be resorted after each add operation.
So when querying a wide range of objects, the partially filled array needs to be resorted up to 255 times, which takes over 100 times longer than sorting once.

Use one oid_array for each subdirectory.
This ensures that entries have to only be sorted a single time.
It also avoids eight binary search steps for each cache lookup as a small bonus.

The cache is used for collision checks for the log placeholders %h, %t and %p, and we can see the change speeding them up in a repository with ca. 100 objects per subdirectory:

$ git count-objects
26733 objects, 68808 kilobytes

Test                        HEAD^             HEAD
--------------------------------------------------------------------
4205.1: log with %H         0.51(0.47+0.04)   0.51(0.49+0.02) +0.0%
4205.2: log with %h         0.84(0.82+0.02)   0.60(0.57+0.03) -28.6%
4205.3: log with %T         0.53(0.49+0.04)   0.52(0.48+0.03) -1.9%
4205.4: log with %t         0.84(0.80+0.04)   0.60(0.59+0.01) -28.6%
4205.5: log with %P         0.52(0.48+0.03)   0.51(0.50+0.01) -1.9%
4205.6: log with %p         0.85(0.78+0.06)   0.61(0.56+0.05) -28.2%
4205.7: log with %h-%h-%h   0.96(0.92+0.03)   0.69(0.64+0.04) -28.1%

With Git 2.26 (Q1 2020), the object reachability bitmap machinery and the partial cloning machinery were not prepared to work well together, because some object-filtering criteria that partial clones use inherently rely on object traversal, but the bitmap machinery is an optimization to bypass that object traversal.

There however are some cases where they can work together, and they were taught about them.

See commit 20a5fd8 (18 Feb 2020) by Junio C Hamano (gitster).
See commit 3ab3185, commit 84243da, commit 4f3bd56, commit cc4aa28, commit 2aaeb9a, commit 6663ae0, commit 4eb707e, commit ea047a8, commit 608d9c9, commit 55cb10f, commit 792f811, commit d90fe06 (14 Feb 2020), and commit e03f928, commit acac50d, commit 551cf8b (13 Feb 2020) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 0df82d9, 02 Mar 2020)

pack-bitmap: implement BLOB_NONE filtering

Signed-off-by: Jeff King

We can easily support BLOB_NONE filters with bitmaps.
Since we know the types of all of the objects, we just need to clear the result bits of any blobs.

Note two subtleties in the implementation (which I also called out in comments):

  • we have to include any blobs that were specifically asked for (and not reached through graph traversal) to match the non-bitmap version
  • we have to handle in-pack and "ext_index" objects separately.
    Arguably prepare_bitmap_walk() could be adding these ext_index objects to the type bitmaps.
    But it doesn't for now, so let's match the rest of the bitmap code here (it probably wouldn't be an efficiency improvement to do so since the cost of extending those bitmaps is about the same as our loop here, but it might make the code a bit simpler).

Here are perf results for the new test on git.git:

Test                                    HEAD^             HEAD
--------------------------------------------------------------------------------
5310.9: rev-list count with blob:none   1.67(1.62+0.05)   0.22(0.21+0.02) -86.8%

To know more aboud oid_array, consider Git 2.27 (Q2 2020)

See commit 0740d0a, commit c79eddf, commit 7383b25, commit ed4b804, commit fe299ec, commit eccce52, commit 600bee4 (30 Mar 2020) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit a768f86, 22 Apr 2020)

oid_array: use size_t for count and allocation

Signed-off-by: Jeff King

The oid_array object uses an "int" to store the number of items and the allocated size.

It's rather unlikely for somebody to have more than 2^31 objects in a repository (the sha1's alone would be 40GB!), but if they do, we'd overflow our alloc variable.

You can reproduce this case with something like:

git init repo
cd repo

# make a pack with 2^24 objects
perl -e '
  my $nr = 2**24;

for (my $i = 0; $i < $nr; $i++) {
 print "blob\n";
 print "data 4\n";
 print pack("N", $i);
}
| git fast-import

# now make 256 copies of it; most of these objects will be duplicates,
# but oid_array doesn't de-dup until all values are read and it can
# sort the result.
cd .git/objects/pack/
pack=$(echo *.pack)
idx=$(echo *.idx)
for i in $(seq 0 255); do
  # no need to waste disk space
  ln "$pack" "pack-extra-$i.pack"
  ln "$idx" "pack-extra-$i.idx"
done

# and now force an oid_array to store all of it
git cat-file --batch-all-objects --batch-check

which results in:

fatal: size_t overflow: 32 * 18446744071562067968

So the good news is that st_mult() sees the problem (the large number is because our int wraps negative, and then that gets cast to a size_t), doing the job it was meant to: bailing in crazy situations rather than causing an undersized buffer.

But we should avoid hitting this case at all, and instead limit ourselves based on what malloc() is willing to give us.
We can easily do that by switching to size_t.

The cat-file process above made it to ~120GB virtual set size before the integer overflow (our internal hash storage is 32-bytes now in preparation for sha256, so we'd expect ~128GB total needed, plus potentially more to copy from one realloc'd block to another)).
After this patch (and about 130GB of RAM+swap), it does eventually read in the whole set. No test for obvious reasons.

Note that this object was defined in sha1-array.c, which has been renamed oid-array.c: a more neutral name, considering Git will be eventually transition from SHA1 to SHA2.


Another optimization:

With Git 2.31 (Q1 2021), the code around the cache-tree extension in the index has been optimized.

See commit a4b6d20, commit 4bdde33, commit 22ad860, commit 845d15d (07 Jan 2021), and commit 0e5c950, commit 4c3e187, commit fa7ca5d, commit c338898, commit da8be8c (04 Jan 2021) by Derrick Stolee (derrickstolee).
See commit 0b72536 (07 Jan 2021) by René Scharfe (rscharfe).
(Merged by Junio C Hamano -- gitster -- in commit a0a2d75, 05 Feb 2021)

cache-tree: speed up consecutive path comparisons

Signed-off-by: Derrick Stolee

The previous change reduced time spent in strlen() while comparing consecutive paths in verify_cache(), but we can do better.
The conditional checks the existence of a directory separator at the correct location, but only after doing a string comparison.
Swap the order to be logically equivalent but perform fewer string comparisons.

To test the effect on performance, I used a repository with over three million paths in the index.
I then ran the following command on repeat:

git -c index.threads=1 commit --amend --allow-empty --no-edit

Here are the measurements over 10 runs after a 5-run warmup:

Benchmark #1: v2.30.0
  Time (mean ± σ):     854.5 ms ±  18.2 ms
  Range (min … max):   825.0 ms … 892.8 ms

Benchmark #2: Previous change
  Time (mean ± σ):     833.2 ms ±  10.3 ms
  Range (min … max):   815.8 ms … 849.7 ms

Benchmark #3: This change
  Time (mean ± σ):     815.5 ms ±  18.1 ms
  Range (min … max):   795.4 ms … 849.5 ms

This change is 2% faster than the previous change and 5% faster than v2.30.0.

Anarchist answered 27/4, 2017 at 21:12 Comment(1)
Instead why not create a blog page and post link here.Scarecrow
I
5

One longer-term solution is to augment git to cache filesystem status internally.

Karsten Blees has done so for msysgit, which dramatically improves performance on Windows. In my experiments, his change has taken the time for "git status" from 25 seconds to 1-2 seconds on my Win7 machine running in a VM.

Karsten's changes: https://github.com/msysgit/git/pull/94

Discussion of the caching approach: https://groups.google.com/forum/#!topic/msysgit/fL_jykUmUNE/discussion

Its answered 17/10, 2013 at 15:29 Comment(3)
Just to follow up on this: Karsten's changes have now been added to the official msysgit distribution.Its
Caching will be killer. Will create lots of complexities, scenarios making git slower.Scarecrow
@AltafPatel caching has been implemented internally for quite some time now (years!) in git for windows and works just fine, providing massive speed improvements. I believe it's on by default. Search for "git fscache" for details.Its
B
5

In general my mac is ok with git but if there are a lot of loose objects then it gets very much slower. It seems hfs is not so good with lots of files in a single directory.

git repack -ad

Followed by

git gc --prune=now

Will make a single pack file and remove any loose objects left over. It can take some time to run these.

Bonita answered 6/3, 2014 at 20:13 Comment(0)
J
3

For what it's worth, I recently found a large discrepancy beween the git status command between my master and dev branches.

To cut a long story short, I tracked down the problem to a single 280MB file in the project root directory. It was an accidental checkin of a database dump so it was fine to delete it.

Here's the before and after:

⚡ time git status
# On branch master
nothing to commit (working directory clean)
git status  1.35s user 0.25s system 98% cpu 1.615 total

⚡ rm savedev.sql

⚡ time git status
# On branch master
# Changes not staged for commit:
#   (use "git add/rm <file>..." to update what will be committed)
#   (use "git checkout -- <file>..." to discard changes in working directory)
#
#   deleted:    savedev.sql
#
no changes added to commit (use "git add" and/or "git commit -a")
git status  0.07s user 0.08s system 98% cpu 0.157 total

I have 105,000 objects in store, but it appears that large files are more of a menace than many small files.

Jacksonjacksonville answered 2/10, 2011 at 16:40 Comment(1)
of course, that's because git status checks every file for changes by re-reading the contents and checking it against .git data, that includes the 280MB every time you call git status.Spiccato
O
2

You could try passing the --aggressive switch to git gc and see if that helps:

# this will take a while ...
git gc --aggressive

Also, you could use git filter-branch to delete old commits and/or files if you have things which you don't need in your history (e.g., old binary files).

Oakum answered 22/7, 2010 at 22:12 Comment(2)
Trying git gc --aggressive. Your right this is going to take awhile.Kallick
git filter-branch will not work for me. There is no history that I can loose.Kallick
C
2

You also might try git repack

Chatty answered 22/7, 2010 at 22:14 Comment(1)
Nothing new to pack. is what git repack returned. ThanksKallick
F
2

Try running Prune command it will get rid off, loose objects

git remote prune origin

Fluoroscopy answered 10/3, 2016 at 11:49 Comment(0)
V
1

maybe spotlight is trying to index the files. Perhaps disable spotlight for your code dir. Check Activity Monitor and see what processes are running.

Valenzuela answered 22/7, 2010 at 22:26 Comment(2)
Well, that is good, but my hard drive has not activity when I am not running git status. I will try this, but I don't think it is of relevance. Thanks.Kallick
Turned off indexing for that directory. This has made no difference. Thanks.Kallick
S
0

I'd create a partition using a different file system. HFT+ has always been sluggish for me compared to doing similar operations on other file systems.

Subtenant answered 24/7, 2010 at 18:43 Comment(2)
I am transferring it to an ext2 partition. I will let you know if it fixes it.Kallick
This does not seem to make that big of a difference. 10 second time gain still shooting to 45 seconds or so.Kallick

© 2022 - 2024 — McMap. All rights reserved.