Is there a way to limit the amount of memory that "git gc" uses?
Asked Answered
F

5

49

I'm hosting a git repo on a shared host. My repo necessarily has a couple of very large files in it, and every time I try to run "git gc" on the repo now, my process gets killed by the shared hosting provider for using too much memory. Is there a way to limit the amount of memory that git gc can consume? My hope would be that it can trade memory usage for speed and just take a little longer to do its work.

Forehand answered 22/6, 2010 at 17:58 Comment(2)
postimum related: https://mcmap.net/q/257603/-git-on-windows-quot-out-of-memory-malloc-failed-quotKerk
Yes, had a similar problem on Dreamhost (which this question is labeled with). Not so many cases when git was killed, but darcs (another VCS) always gets killed, so it's unusable on Dreamhost.comExcitor
R
16

Yes, have a look at the help page for git config and look at the pack.* options, specifically pack.depth, pack.window, pack.windowMemory and pack.deltaCacheSize.

It's not a totally exact size as git needs to map each object into memory so one very large object can cause a lot of memory usage regardless of the window and delta cache settings.

You may have better luck packing locally and transfering pack files to the remote side "manually", adding a .keep files so that the remote git doesn't ever try to completely repack everything.

Rowel answered 22/6, 2010 at 18:5 Comment(0)
D
50

I used instructions from this link. Same idea as Charles Baileys suggested.

A copy of the commands is here:

git config --global pack.windowMemory "100m"
git config --global pack.packSizeLimit "100m"
git config --global pack.threads "1"

This worked for me on hostgator with shared hosting account.

Dryfoos answered 6/1, 2012 at 16:59 Comment(5)
Thanks! This works for me, but I think there's a typo in the second line - there's no SizeLimit option; it should read: git config --global pack.packSizeLimit "100m"Kuban
This worked perfectly. If it doesn't work at first, try a lower limit on windowMemory and packSizeLimit. In my case, 25m was the sweet spot.Henni
I changed the option name. The original link is broken, not sure where to point it to.Lucid
I've updated the broken link to a save from the Wayback Machine.Sociometry
Seems like it's working for me to avoid fatal crashes, but now I got a "warning: suboptimal pack - out of memory" (but git finishes anyway). Probably I should try to set the sizes to more than 100mb and see if it still finishes. After all initially it tried to do it with 24 threads, so limiting that to 1 should already help a lot...Septa
R
16

Yes, have a look at the help page for git config and look at the pack.* options, specifically pack.depth, pack.window, pack.windowMemory and pack.deltaCacheSize.

It's not a totally exact size as git needs to map each object into memory so one very large object can cause a lot of memory usage regardless of the window and delta cache settings.

You may have better luck packing locally and transfering pack files to the remote side "manually", adding a .keep files so that the remote git doesn't ever try to completely repack everything.

Rowel answered 22/6, 2010 at 18:5 Comment(0)
T
16

Git repack's memory use is: (pack.deltaCacheSize + pack.windowMemory) × pack.threads. Respective defaults are 256MiB, unlimited, nproc.

The delta cache isn't useful: most of the time is spent computing deltas on a sliding window, the majority of which are discarded; caching the survivors so they can be reused once (when writing) won't improve the runtime. That cache also isn't shared between threads.

By default the window memory is limited through pack.window (gc.aggressiveWindow). Limiting packing that way is a bad idea, because the working set size and efficiency will vary widely. It's best to raise both to much higher values and rely on pack.windowMemory to limit the window size.

Finally, threading has the disadvantage of splitting the working set. Lowering pack.threads and increasing pack.windowMemory so that the total stays the same should improve the run time.

repack has other useful tunables (pack.depth, pack.compression, the bitmap options), but they don't affect memory use.

Tangential answered 8/1, 2015 at 11:50 Comment(1)
Does not seem the full truth? Do you have an idea: #42175796Unionism
R
7

You could use turn off the delta attribute to disable delta compression for just the blobs of those pathnames:

In foo/.git/info/attributes (or foo.git/info/attributes if it is a bare repository) (see the delta entry in gitattributes and see gitignore for the pattern syntax):

/large_file_dir/* -delta
*.psd -delta
/data/*.iso -delta
/some/big/file -delta
another/file/that/is/large -delta

This will not affect clones of the repository. To affect other repositories (i.e. clones), put the attributes in a .gitattributes file instead of (or in addition to) the info/attributes file.

Rosenzweig answered 22/6, 2010 at 20:0 Comment(1)
This is by far the most helpful answer where large files are concerned. Thanks. I have a repo of some PSDs, and it used to take gigabytes of memory to do a git gc, now it takes under 100MB of RAM. Cool.Exposed
L
7

Git 2.18 (Q2 2018) will improve the gc memory consumption.
Before 2.18, "git pack-objects" needs to allocate tons of "struct object_entry" while doing its work: shrinking its size helps the performance quite a bit.
This influences git gc.

See commit f6a5576, commit 3b13a5f, commit 0aca34e, commit ac77d0c, commit 27a7d06, commit 660b373, commit 0cb3c14, commit 898eba5, commit 43fa44f, commit 06af3bb, commit b5c0cbd, commit 0c6804a, commit fd9b1ba, commit 8d6ccce, commit 4c2db93 (14 Apr 2018) by Nguyễn Thái Ngọc Duy (pclouds).
(Merged by Junio C Hamano -- gitster -- in commit ad635e8, 23 May 2018)

pack-objects: reorder members to shrink struct object_entry

Previous patches leave lots of holes and padding in this struct.
This patch reorders the members and shrinks the struct down to 80 bytes (from 136 bytes on 64-bit systems, before any field shrinking is done) with 16 bits to spare (and a couple more in in_pack_header_size when we really run out of bits).

This is the last in a series of memory reduction patches (see "pack-objects: a bit of document about struct object_entry" for the first one).

Overall they've reduced repack memory size on linux-2.6.git from 3.747G to 3.424G, or by around 320M, a decrease of 8.5%.
The runtime of repack has stayed the same throughout this series.
Ævar's testing on a big monorepo he has access to (bigger than linux-2.6.git) has shown a 7.9% reduction, so the overall expected improvement should be somewhere around 8%.


With Git 2.20 (Q4 2018), it will be easier to check an object that exists in one fork is not made into a delta against another object that does not appear in the same forked repository.

See commit fe0ac2f, commit 108f530, commit f64ba53 (16 Aug 2018) by Christian Couder (chriscool).
Helped-by: Jeff King (peff), and Duy Nguyen (pclouds).
See commit 9eb0986, commit 16d75fa, commit 28b8a73, commit c8d521f (16 Aug 2018) by Jeff King (peff).
Helped-by: Jeff King (peff), and Duy Nguyen (pclouds).
(Merged by Junio C Hamano -- gitster -- in commit f3504ea, 17 Sep 2018)

pack-objects: move 'layer' into 'struct packing_data'

This reduces the size of 'struct object_entry' from 88 bytes to 80 and therefore makes packing objects more efficient.

For example on a Linux repo with 12M objects, git pack-objects --all needs extra 96MB memory even if the layer feature is not used.


Note that Git 2.21 (Feb. 2019) fixes a small bug: "git pack-objects" incorrectly used uninitialized mutex, which has been corrected.

See commit edb673c, commit 459307b (25 Jan 2019) by Patrick Hogg (``).
Helped-by: Junio C Hamano (gitster).
(Merged by Junio C Hamano -- gitster -- in commit d243a32, 05 Feb 2019)

pack-objects: move read mutex to packing_data struct

ac77d0c ("pack-objects: shrink size field in struct object_entry", 2018-04-14) added an extra usage of read_lock/read_unlock in the newly introduced oe_get_size_slow for thread safety in parallel calls to try_delta().
Unfortunately oe_get_size_slow is also used in serial code, some of which is called before the first invocation of ll_find_deltas.
As such the read mutex is not guaranteed to be initialized.

Resolve this by moving the read mutex to packing_data and initializing it in prepare_packing_data which is initialized in cmd_pack_objects.


Git 2.21 (Feb. 2019) still find another way to shrink the size of the pack with "git pack-objects" learning another algorithm to compute the set of objects to send, that trades the resulting packfile off to save traversal cost to favor small pushes.

pack-objects: create pack.useSparse setting

The '--sparse' flag in 'git pack-objects' changes the algorithm used to enumerate objects to one that is faster for individual users pushing new objects that change only a small cone of the working directory.
The sparse algorithm is not recommended for a server, which likely sends new objects that appear across the entire working directory.

Create a 'pack.useSparse' setting that enables this new algorithm.
This allows 'git push' to use this algorithm without passing a '--sparse' flag all the way through four levels of run_command() calls.

If the '--no-sparse' flag is set, then this config setting is overridden.

The config pack documentation now includes:

pack.useSparse:

When true, Git will default to using the '--sparse' option in 'git pack-objects' when the '--revs' option is present.
This algorithm only walks trees that appear in paths that introduce new objects.

This can have significant performance benefits when computing a pack to send a small change.

However, it is possible that extra objects are added to the pack-file if the included commits contain certain types of direct renames.

See "git push is very slow for a huge repo" for a concrete illustration.


Note: as commented in Git 2.24, a setting like pack.useSparse is still experimental.

See commit aaf633c, commit c6cc4c5, commit ad0fb65, commit 31b1de6, commit b068d9a, commit 7211b9e (13 Aug 2019) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit f4f8dfe, 09 Sep 2019)

repo-settings: create feature.experimental setting

The 'feature.experimental' setting includes config options that are not committed to become defaults, but could use additional testing.

Update the following config settings to take new defaults, and to use the repo_settings struct if not already using them:

  • 'pack.useSparse=true'
  • 'fetch.negotiationAlgorithm=skipping'

With Git 2.26 (Q1 2020), The way "git pack-objects" reuses objects stored in existing pack to generate its result has been improved.

See commit d2ea031, commit 92fb0db, commit bb514de, commit ff48302, commit e704fc7, commit 2f4af77, commit 8ebf529, commit 59b2829, commit 40d18ff, commit 14fbd26 (18 Dec 2019), and commit 56d9cbe, commit bab28d9 (13 Sep 2019) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit a14aebe, 14 Feb 2020)

pack-objects: improve partial packfile reuse

Helped-by: Jonathan Tan
Signed-off-by: Jeff King
Signed-off-by: Christian Couder

The old code to reuse deltas from an existing packfile just tried to dump a whole segment of the pack verbatim. That's faster than the traditional way of actually adding objects to the packing list, but it didn't kick in very often. This new code is really going for a middle ground: do some per-object work, but way less than we'd traditionally do.

The general strategy of the new code is to make a bitmap of objects from the packfile we'll include, and then iterate over it, writing out each object exactly as it is in our on-disk pack, but not adding it to our packlist (which costs memory, and increases the search space for deltas).

One complication is that if we're omitting some objects, we can't set a delta against a base that we're not sending. So we have to check each object in try_partial_reuse() to make sure we have its delta.

About performance, in the worst case we might have interleaved objects that we are sending or not sending, and we'd have as many chunks as objects. But in practice we send big chunks.

For instance, packing torvalds/linux on GitHub servers now reused 6.5M objects, but only needed ~50k chunks.


With Git 2.34 (Q4 2021), git repack itself (used by git gc) benefits from a reduced memory usage.

See commit b017334, commit a9fd2f2, commit a241878 (29 Aug 2021) by Taylor Blau (ttaylorr).
(Merged by Junio C Hamano -- gitster -- in commit 9559de3, 10 Sep 2021)

builtin/pack-objects.c: remove duplicate hash lookup

Signed-off-by: Taylor Blau

In the original code from 08cdfb1 ("pack-objects --keep-unreachable", 2007-09-16, Git v1.5.4-rc0 -- merge), we add each object to the packing list with type ``obj->type, where obj comes from lookup_unknown_object().
Unless we had already looked up and parsed the object, this will be OBJ_NONE.
That's fine, since oe_set_type() sets the type_valid bit to '0', and we determine the real type later on.

So the only thing we need from the object lookup is access to the flags field so that we can mark that we've added the object with OBJECT_ADDED to avoid adding it again (we can just pass OBJ_NONE directly instead of grabbing it from the object).

But add_object_entry() already rejects duplicates! This has been the behavior since 7a979d9 ("Thin pack - create packfile with missing delta base.", 2006-02-19, Git v1.3.0-rc1 -- merge), but 08cdfb1 didn't take advantage of it.
Moreover, to do the OBJECT_ADDED check, we have to do a hash lookup in obj_hash.

So we can drop the lookup_unknown_object() call completely, and the OBJECT_ADDED flag, too, since the spot we're touching here is the only location that checks it.

In the end, we perform the same number of hash lookups, but with the added bonus that we don't waste memory allocating an OBJ_NONE object (if we were traversing, we'd need it eventually, but the whole point of this code path is not to traverse).


The interaction between fetch.negotiationAlgorithm and feature.experimental configuration variables has been corrected with Git 2.36 (Q2 2022).

See commit 714edc6, commit a9a136c, commit a68c5b9 (02 Feb 2022) by Elijah Newren (newren).
(Merged by Junio C Hamano -- gitster -- in commit 70ff41f, 16 Feb 2022)

repo-settings: rename the traditional default fetch.negotiationAlgorithm

Signed-off-by: Elijah Newren

Give the traditional default fetch.negotiationAlgorithm the name 'consecutive'.
Also allow a choice of 'default' to have Git decide between the choices (currently, picking 'skipping' if feature.experimental is true and 'consecutive' otherwise).
Update the documentation accordingly.

git config now includes in its man page:

Control how information about the commits in the local repository is sent when negotiating the contents of the packfile to be sent by the server.

  • Set to "consecutive" to use an algorithm that walks over consecutive commits checking each one.
  • Set to "skipping" to use an algorithm that skips commits in an effort to converge faster, but may result in a larger-than-necessary packfile; or set to "noop" to not send any information at all, which will almost certainly result in a larger-than-necessary packfile, but will skip the negotiation step.
  • Set to "default" to override settings made previously and use the default behaviour.

The default is normally "consecutive", but if feature.experimental is true, then the default is "skipping".
Unknown values will cause 'git fetch' to error out (unknown fetch negotiation algorithm).


With Git 2.44 (Q1 2024), streaming spans of packfile data used to be done only from a single, primary, pack in a repository with multiple packfiles.
It has been extended to allow reuse from other packfiles, too. That can influence the gc.

See commit ba47d88, commit af626ac, commit 9410741, commit 3bea0c0, commit 54393e4, commit 519e17f, commit dbd5c52, commit e1bfe30, commit b1e3333, commit ed9f414, commit b96289a, commit ca0fd69, commit 4805125, commit 073b40e, commit d1d701e, commit 5e29c3f, commit 83296d2, commit 35e156b, commit e5d48bf, commit dab6093, commit 307d75b, commit 5f5ccd9, commit fba6818, commit a96015a, commit 6cdb67b, commit 66f0c71 (14 Dec 2023) by Taylor Blau (ttaylorr).
(Merged by Junio C Hamano -- gitster -- in commit 0fea6b7, 12 Jan 2024)

pack-bitmap: enable reuse from all bitmapped packs

Signed-off-by: Taylor Blau

Now that both the pack-bitmap and pack-objects code are prepared to handle marking and using objects from multiple bitmapped packs for verbatim reuse, allow marking objects from all bitmapped packs as eligible for reuse.

Within the reuse_partial_packfile_from_bitmap() function, we no longer only mark the pack whose first object is at bit position zero for reuse, and instead mark any pack contained in the MIDX as a reuse candidate.

Provide a handful of test cases in a new script (t5332) exercising interesting behavior for multi-pack reuse to ensure that we performed all of the previous steps correctly.

git config now includes in its man page:

When true or "single", and when reachability bitmaps are enabled, pack-objects will try to send parts of the bitmapped packfile verbatim.
When "multi", and when a multi-pack reachability bitmap is available, pack-objects will try to send parts of all packs in the MIDX.

If only a single pack bitmap is available, and pack.allowPackReuse is set to "multi", reuse parts of just the bitmapped packfile. This can reduce memory and CPU usage to serve fetches, but might result in sending a slightly larger pack.
Defaults to true.


With Git 2.44 (Q1 2024), rc1, setting feature.experimental opts the user into multi-pack reuse experiment

See commit 23c1e71, commit 7c01878 (05 Feb 2024) by Taylor Blau (ttaylorr).
(Merged by Junio C Hamano -- gitster -- in commit 3b89ff1, 12 Feb 2024)

pack-objects: enable multi-pack reuse via feature.experimental

Signed-off-by: Taylor Blau

Now that multi-pack reuse is supported, enable it via the feature.experimental configuration in addition to the classic pack.allowPackReuse.

This will allow more users to experiment with the new behavior who might not otherwise be aware of the existing pack.allowPackReuse configuration option.

The enum with values NO_PACK_REUSE, SINGLE_PACK_REUSE, and MULTI_PACK_REUSE is defined statically in builtin/pack-objects.c's compilation unit.
We could hoist that enum into a scope visible from the repository_settings struct, and then use that enum value in pack-objects.
Instead, define a single int that indicates what pack-objects's default value should be to avoid additional unnecessary code movement.

Though feature.experimental implies pack.allowPackReuse=multi, this can still be overridden by explicitly setting the latter configuration to either "single" or "false".

git config now includes in its man page:

  • pack.allowPackReuse=multi may improve the time it takes to create a pack by reusing objects from multiple packs instead of just one.
Lorileelorilyn answered 25/5, 2018 at 23:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.