Why is my git repository so big?
Asked Answered
S

14

159

145M = .git/objects/pack/

I wrote a script to add up the sizes of differences of each commit and the commit before it going backwards from the tip of each branch. I get 129MB, which is without compression and without accounting for same files across branches and common history among branches.

Git takes all those things into account so I would expect much much smaller repository. So why is .git so big?

I've done:

git fsck --full
git gc --prune=today --aggressive
git repack

To answer about how many files/commits, I have 19 branches about 40 files in each. 287 commits, found using:

git log --oneline --all|wc -l

It should not be taking 10's of megabytes to store information about this.

Sialagogue answered 22/6, 2009 at 23:52 Comment(5)
Linus recommends the following over aggressive gc. Does it make a significant difference? git repack -a -d --depth=250 --window=250Teage
thanks gbacon, but no difference.Sialagogue
That's because you are missing the -f. metalinguist.wordpress.com/2007/12/06/…Chard
git repack -a -d shrunk my 956MB repo to 250MB. Great success! Thanks!Ulrick
One caveat I found was that if you have git submodules, then the .git repo of the submodules show up in the super module's .git directory, so du may be misleading about the super module being large, when it is in fact a submodule and the answers below need to be run in the submodule directory.Thompkins
C
72

I recently pulled the wrong remote repository into the local one (git remote add ... and git remote update). After deleting the unwanted remote ref, branches and tags I still had 1.4GB (!) of wasted space in my repository. I was only able to get rid of this by cloning it with git clone file:///path/to/repository. Note that the file:// makes a world of difference when cloning a local repository - only the referenced objects are copied across, not the whole directory structure.

Edit: Here's Ian's one liner for recreating all branches in the new repo:

d1=#original repo
d2=#new repo (must already exist)
cd $d1
for b in $(git branch | cut -c 3-)
do
    git checkout $b
    x=$(git rev-parse HEAD)
    cd $d2
    git checkout -b $b $x
    cd $d1
done
Comprehensible answered 24/6, 2009 at 4:40 Comment(8)
wow. THANK YOU. .git = 15M now!! after cloning, here is a little 1 liner for preserving your previous branches. d1=#original repo; d2=#new repo; cd $d1; for b in $(git branch | cut -c 3-); do git checkout $b; x=$(git rev-parse HEAD); cd $d2; git checkout -b $b $x; cd $d1; doneSialagogue
if you check this, you could add the 1 liner to your answer so its formatted as code.Sialagogue
I foolishly added a bunch of video files to my repo, and had to reset --soft HEAD^ and recommit. The .git/objects dir was huge after that, and this was the only way that got it back down. However I didn't like the way the one liner changed my branch names around (it showed origin/branchname instead of just branchname). So I went a step further and executed some sketchy surgery--I deleted the .git/objects directory from the original, and put in the one from the clone. That did the trick, leaving all of the original branches, refs, etc intact, and everything seems to work (crossing fingers).Pyonephritis
thanks for the tip about the file:// clone, that did the trick for meDonohoe
Be careful, git just links to the original when cloning locally (to save space, why have the same stuff twice?). Yes, you get a small clone; no, you can not delete the original, that would break the clone.Teraterai
@Teraterai if you hard link to a file and delete the original file, nothing happens except that a reference counter gets decremented from 2 to 1. Only if that counter gets decremented to 0 the space is freed for other files on the fs. So no, even if the files were hard linked nothing would happen if the original gets deleted.Pepe
@IanKelling please add that the new repo dir should already exist. I just messed up my repo because directory #2 didn't exist...Pliers
OMGolly! Not sure why this worked but this is fantastic.Vikiviking
V
186

Some scripts I use:

git-fatfiles

git rev-list --all --objects | \
    sed -n $(git rev-list --objects --all | \
    cut -f1 -d' ' | \
    git cat-file --batch-check | \
    grep blob | \
    sort -n -k 3 | \
    tail -n40 | \
    while read hash type size; do 
         echo -n "-e s/$hash/$size/p ";
    done) | \
    sort -n -k1
...
89076 images/screenshots/properties.png
103472 images/screenshots/signals.png
9434202 video/parasite-intro.avi

If you want more lines, see also Perl version in a neighbouring answer: https://mcmap.net/q/12766/-why-is-my-git-repository-so-big

git-eradicate (for video/parasite.avi):

git filter-branch -f  --index-filter \
    'git rm --force --cached --ignore-unmatch video/parasite-intro.avi' \
     -- --all
rm -Rf .git/refs/original && \
    git reflog expire --expire=now --all && \
    git gc --aggressive && \
    git prune

Note: the second script is designed to remove info from Git completely (including all info from reflogs). Use with caution.

Voncile answered 15/1, 2013 at 1:52 Comment(11)
Finally... Ironically I saw this answer earlier in my search but it looked too complicated...after trying other things, this one started to make sense and voila!Lamar
@msanteler, The former (git-fatfiles) script has emerged when I asked the question on IRC (Freenode/#git). I saved the best version to a file, then posted it as an answer here. (I can't the original author in IRC logs although).Voncile
This works very well initially. But when I fetch or pull from the remote again, it just copies all the big files back into the archive. How do I prevent that?Loculus
@felbo, Then the problem is probably not just in your local repository, but in other repositories as well. Maybe you need to do the procedure everywhere, or force everybody abandon original branches and switch to rewritten branches. It is not easy in a big team and needs cooperation between developers and/or manager intervention. Sometimes just leaving the loadstone inside can be better option.Voncile
This function is great, but it's unimaginably slow. It can't even finish on my computer if I remove the 40 line limit. FYI, I just added an answer with a more efficient version of this function. Check it out if you want to use this logic on a big repository, or if you want to see the sizes summed per file or per folder.Quixote
I've committed a 10Mb image, noticed the mess, resized to 100Kb and committed again with same name. Your script for listing fat-files now lists two files with same name. When using filter-branch, how does it know which one to delete?Therewith
@yellow01, You'll need more advanced solution. Or filter branch starting from the commit where you had the image removed (then rebase the rest on top of it).Voncile
How could I use that script? command? - if thats a terminal command, then it did nothing in my case.Xeniaxeno
The fastest (and easiest) way to clean up a bloated GIT history is to use the BFG (rtyley.github.io/bfg-repo-cleaner)Scheider
This worked for me. @Scheider thanks for the link for the BFG as well.Digiacomo
how to execute that script??Wifely
C
72

I recently pulled the wrong remote repository into the local one (git remote add ... and git remote update). After deleting the unwanted remote ref, branches and tags I still had 1.4GB (!) of wasted space in my repository. I was only able to get rid of this by cloning it with git clone file:///path/to/repository. Note that the file:// makes a world of difference when cloning a local repository - only the referenced objects are copied across, not the whole directory structure.

Edit: Here's Ian's one liner for recreating all branches in the new repo:

d1=#original repo
d2=#new repo (must already exist)
cd $d1
for b in $(git branch | cut -c 3-)
do
    git checkout $b
    x=$(git rev-parse HEAD)
    cd $d2
    git checkout -b $b $x
    cd $d1
done
Comprehensible answered 24/6, 2009 at 4:40 Comment(8)
wow. THANK YOU. .git = 15M now!! after cloning, here is a little 1 liner for preserving your previous branches. d1=#original repo; d2=#new repo; cd $d1; for b in $(git branch | cut -c 3-); do git checkout $b; x=$(git rev-parse HEAD); cd $d2; git checkout -b $b $x; cd $d1; doneSialagogue
if you check this, you could add the 1 liner to your answer so its formatted as code.Sialagogue
I foolishly added a bunch of video files to my repo, and had to reset --soft HEAD^ and recommit. The .git/objects dir was huge after that, and this was the only way that got it back down. However I didn't like the way the one liner changed my branch names around (it showed origin/branchname instead of just branchname). So I went a step further and executed some sketchy surgery--I deleted the .git/objects directory from the original, and put in the one from the clone. That did the trick, leaving all of the original branches, refs, etc intact, and everything seems to work (crossing fingers).Pyonephritis
thanks for the tip about the file:// clone, that did the trick for meDonohoe
Be careful, git just links to the original when cloning locally (to save space, why have the same stuff twice?). Yes, you get a small clone; no, you can not delete the original, that would break the clone.Teraterai
@Teraterai if you hard link to a file and delete the original file, nothing happens except that a reference counter gets decremented from 2 to 1. Only if that counter gets decremented to 0 the space is freed for other files on the fs. So no, even if the files were hard linked nothing would happen if the original gets deleted.Pepe
@IanKelling please add that the new repo dir should already exist. I just messed up my repo because directory #2 didn't exist...Pliers
OMGolly! Not sure why this worked but this is fantastic.Vikiviking
N
69

git gc already does a git repack so there is no sense in manually repacking unless you are going to be passing some special options to it.

The first step is to see whether the majority of space is (as would normally be the case) your object database.

git count-objects -v

This should give a report of how many unpacked objects there are in your repository, how much space they take up, how many pack files you have and how much space they take up.

Ideally, after a repack, you would have no unpacked objects and one pack file but it's perfectly normal to have some objects which aren't directly reference by current branches still present and unpacked.

If you have a single large pack and you want to know what is taking up the space then you can list the objects which make up the pack along with how they are stored.

git verify-pack -v .git/objects/pack/pack-*.idx

Note that verify-pack takes an index file and not the pack file itself. This give a report of every object in the pack, its true size and its packed size as well as information about whether it's been 'deltified' and if so the origin of delta chain.

To see if there are any unusally large objects in your repository you can sort the output numerically on the third of fourth columns (e.g. | sort -k3n).

From this output you will be able to see the contents of any object using the git show command, although it is not possible to see exactly where in the commit history of the repository the object is referenced. If you need to do this, try something from this question.

Neron answered 24/6, 2009 at 5:54 Comment(2)
This found the big objects great. The accepted answer got rid of them.Sialagogue
The difference between git gc and git repack according to linus torvalds. metalinguist.wordpress.com/2007/12/06/…Chard
S
42

Just FYI, the biggest reason why you may end up with unwanted objects being kept around is that git maintains a reflog.

The reflog is there to save your butt when you accidentally delete your master branch or somehow otherwise catastrophically damage your repository.

The easiest way to fix this is to truncate your reflogs before compressing (just make sure that you never want to go back to any of the commits in the reflog).

git gc --prune=now --aggressive
git repack

This is different from git gc --prune=today in that it expires the entire reflog immediately.

Shekinah answered 6/1, 2013 at 19:53 Comment(2)
This one did it for me! I went from about 5gb to 32mb.Otalgia
This answer seemed easier to do but unfortunately did not work for me. In my case I was working on a just cloned repository. Is that the reason?Peradventure
S
18

If you want to find what files are taking up space in your git repository, run

git verify-pack -v .git/objects/pack/*.idx | sort -k 3 -n | tail -5

Then, extract the blob reference that takes up the most space (the last line), and check the filename that is taking so much space

git rev-list --objects --all | grep <reference>

This might even be a file that you removed with git rm, but git remembers it because there are still references to it, such as tags, remotes and reflog.

Once you know what file you want to get rid of, I recommend using git forget-blob

https://ownyourbits.com/2017/01/18/completely-remove-a-file-from-a-git-repository-with-git-forget-blob/

It is easy to use, just do

git forget-blob file-to-forget

This will remove every reference from git, remove the blob from every commit in history, and run garbage collection to free up the space.

Scythia answered 23/1, 2017 at 12:50 Comment(0)
Q
8

The git-fatfiles script from Vi's answer is lovely if you want to see the size of all your blobs, but it's so slow as to be unusable. I removed the 40-line output limit, and it tried to use all my computer's RAM instead of finishing. Plus it would give inaccurate results when summing the output to see all space used by a file.

I rewrote it in rust, which I find to be less error-prone than other languages. I also added the feature of summing up the space used by all commits in various directories if the --directories flag is passed. Paths can be given to limit the search to certain files or directories.

src/main.rs:

use std::{
    collections::HashMap,
    io::{self, BufRead, BufReader, Write},
    path::{Path, PathBuf},
    process::{Command, Stdio},
    thread,
};

use bytesize::ByteSize;
use structopt::StructOpt;

#[derive(Debug, StructOpt)]
#[structopt()]
pub struct Opt {
    #[structopt(
        short,
        long,
        help("Show the size of directories based on files committed in them.")
    )]
    pub directories: bool,

    #[structopt(help("Optional: only show the size info about certain paths."))]
    pub paths: Vec<String>,
}

/// The paths list is a filter. If empty, there is no filtering.
/// Returns a map of object ID -> filename.
fn get_revs_for_paths(paths: Vec<String>) -> HashMap<String, PathBuf> {
    let mut process = Command::new("git");
    let mut process = process.arg("rev-list").arg("--all").arg("--objects");

    if !paths.is_empty() {
        process = process.arg("--").args(paths);
    };

    let output = process
        .output()
        .expect("Failed to execute command git rev-list.");

    let mut id_map = HashMap::new();
    for line in io::Cursor::new(output.stdout).lines() {
        if let Some((k, v)) = line
            .expect("Failed to get line from git command output.")
            .split_once(' ')
        {
            id_map.insert(k.to_owned(), PathBuf::from(v));
        }
    }
    id_map
}

/// Returns a map of object ID to size.
fn get_sizes_of_objects(ids: Vec<&String>) -> HashMap<String, u64> {
    let mut process = Command::new("git")
        .arg("cat-file")
        .arg("--batch-check=%(objectname) %(objecttype) %(objectsize:disk)")
        .stdin(Stdio::piped())
        .stdout(Stdio::piped())
        .spawn()
        .expect("Failed to execute command git cat-file.");
    let mut stdin = process.stdin.expect("Could not open child stdin.");

    let ids: Vec<String> = ids.into_iter().cloned().collect(); // copy data for thread

    // Stdin will block when the output buffer gets full, so it needs to be written
    // in a thread:
    let write_thread = thread::spawn(|| {
        for obj_id in ids {
            writeln!(stdin, "{}", obj_id).expect("Could not write to child stdin");
        }
        drop(stdin);
    });

    let output = process
        .stdout
        .take()
        .expect("Could not get output of command git cat-file.");

    let mut id_map = HashMap::new();
    for line in BufReader::new(output).lines() {
        let line = line.expect("Failed to get line from git command output.");

        let line_split: Vec<&str> = line.split(' ').collect();

        // skip non-blob objects
        if let [id, "blob", size] = &line_split[..] {
            id_map.insert(
                id.to_string(),
                size.parse::<u64>().expect("Could not convert size to int."),
            );
        };
    }
    write_thread.join().unwrap();
    id_map
}

fn main() {
    let opt = Opt::from_args();

    let revs = get_revs_for_paths(opt.paths);
    let sizes = get_sizes_of_objects(revs.keys().collect());

    // This skips directories (they have no size mapping).
    // Filename -> size mapping tuples. Files are present in the list more than once.
    let file_sizes: Vec<(&Path, u64)> = sizes
        .iter()
        .map(|(id, size)| (revs[id].as_path(), *size))
        .collect();

    // (Filename, size) tuples.
    let mut file_size_sums: HashMap<&Path, u64> = HashMap::new();
    for (mut path, size) in file_sizes.into_iter() {
        if opt.directories {
            // For file path "foo/bar", add these bytes to path "foo/"
            let parent = path.parent();
            path = match parent {
                Some(parent) => parent,
                _ => {
                    eprint!("File has no parent directory: {}", path.display());
                    continue;
                }
            };
        }

        *(file_size_sums.entry(path).or_default()) += size;
    }
    let sizes: Vec<(&Path, u64)> = file_size_sums.into_iter().collect();

    print_sizes(sizes);
}

fn print_sizes(mut sizes: Vec<(&Path, u64)>) {
    sizes.sort_by_key(|(_path, size)| *size);
    for file_size in sizes.iter() {
        // The size needs some padding--a long size is as long as a tabstop
        println!("{:10}{}", ByteSize(file_size.1), file_size.0.display())
    }
}

Cargo.toml:

[package]
name = "git-fatfiles"
version = "0.1.0"
edition = "2018"
[dependencies]
structopt = { version = "0.3"}
bytesize = {version = "1"}

Options:

USAGE:
    git-fatfiles [FLAGS] [paths]...

FLAGS:
    -d, --directories    Show the size of directories based on files committed in them.
    -h, --help           Prints help information

ARGS:
    <paths>...    Optional: only show the size info about certain paths.
Quixote answered 28/7, 2017 at 6:8 Comment(6)
Heads-up that this doesn't handle paths with spaces correctly. You can see my fix here: github.com/truist/settings/commit/…Iago
@NathanArthur Thanks for the info! I just rewrote the script in Rust and linked to your github as the original version. Let me know if you prefer I don't link to it in the answer. While I was working I also noticed %fileSizes should not be a hash, since filenames appear more than once in the data. The Rust version is fixed, but I'm not sure what the semantics of the Perl version should be when a file appears in the data more than once. I made --sum not optional, which clears up the semantics.Quixote
I spent some time looking at the problem with %fileSizes and I don't agree that it's wrong. Your new implementation (and the old one, with --sum) will tell the cumulative size used by a file throughout its history. But that might obscure the giant files that might be in the history somewhere; a frequently-changed small file might have a huge cumulative size. Both versions are useful. In my local example, the worst single file is the 20th-largest file (cumulatively), and the other 19 are just source files with lots of changes.Iago
(Also FWIW a perl script is much easier to copy and run than a rust script. I had to install rust and learn many things about rust package management just to run this.)Iago
@NathanArthur In the original version of this code (yours as well), the script gives a different result every time it runs. Using a hash for %fileSizes seems okay only if every iteration does a comparison and only updates if the new size is larger. And sorry about the inconvenience of installing rust. It's the trade-off I chose to reduce bugs and improve readability. At least it's easier than dotnet or java project setup. I'll make the project file naming more explicit.Quixote
I started investigating this and went down a rabbit hole of bug discovery. You're right about %fileSizes and it breaks --sum (and --directories) entirely. I rewrote the script from scratch and described my findings in a new answer. The new script is at the same URL as the old one.Iago
T
4

Are you sure you are counting just the .pack files and not the .idx files? They are in the same directory as the .pack files, but do not have any of the repository data (as the extension indicates, they are nothing more than indexes for the corresponding pack — in fact, if you know the correct command, you can easily recreate them from the pack file, and git itself does it when cloning, as only a pack file is transferred using the native git protocol).

As a representative sample, I took a look at my local clone of the linux-2.6 repository:

$ du -c *.pack
505888  total

$ du -c *.idx
34300   total

Which indicates an expansion of around 7% should be common.

There are also the files outside objects/; in my personal experience, of them index and gitk.cache tend to be the biggest ones (totaling 11M in my clone of the linux-2.6 repository).

Treenware answered 23/6, 2009 at 1:55 Comment(0)
A
3

Other git objects stored in .git include trees, commits, and tags. Commits and tags are small, but trees can get big particularly if you have a very large number of small files in your repository. How many files and how many commits do you have?

Alyose answered 23/6, 2009 at 0:39 Comment(2)
Good question. 19 branches with about 40 files in each. git count-objects -v says "in-pack: 1570". Not sure exactly what that means or how to count how many commits I have. A few hundred I'd guess.Sialagogue
Ok, it doesn't sound like that is the answer then. A few hundred will be insignificant compared to 145 MB.Alyose
J
2

Did you try using git repack?

Jayjaycee answered 23/6, 2009 at 0:21 Comment(2)
Good question. I did, I also got the impression that git gc does that also?Sialagogue
It does with git gc --auto Not sure about what you used.Jayjaycee
O
2

before doing git filter-branch & git gc you should review tags that are present in your repo. Any real system which has automatic tagging for things like continuous integration and deployments will make unwated objects still refrenced by these tags , hence gc cant remove them and you will still keep wondering why the size of repo is still so big.

The best way to get rid of all un-wanted stuff is to run git-filter & git gc and then push master to a new bare repo. The new bare repo will have the cleaned up tree.

Obadiah answered 9/9, 2011 at 1:54 Comment(0)
R
1

This can happen if you added a big chunk of files accidentally and staged them, not necessarily commit them. This can happen in a rails app when you run bundle install --deployment and then accidentally git add . then you see all the files added under vendor/bundle you unstage them but they already got into git history, so you have to apply Vi's answer and change video/parasite-intro.avi by vendor/bundle then run the second command he provides.

You can see the difference with git count-objects -v which in my case before applying the script had a size-pack: of 52K and after applying it was 3.8K.

Rupe answered 8/11, 2016 at 0:28 Comment(0)
A
1

It is worth checking the stacktrace.log. It is basically an error log for tracing commits that failed. I've recently found out that my stacktrace.log is 65.5GB and my app is 66.7GB.

Azotic answered 9/4, 2018 at 6:47 Comment(0)
I
1

I've created a new implementation of the perl script that was originally provided in this answer (which has since been rewritten in rust). After much investigation of that perl script, I realized that it had multiple bugs:

  • Errors with paths with spaces
  • --sum didn't work correctly (it wasn't actually adding up all the deltas)
  • --directory didn't work correctly (it relies on --sum)
  • Without --sum it would report a size of an effectively-random object for the given path, which might not have been the largest one

So I ended up rewriting the script entirely. It uses the same sequence of git commands (git rev-list and git cat-file) but then it processes the data correctly to give accurate results. I preserved the --sum and --directories features.

I also changed it to report the "disk" size (i.e. the compressed size in the git repo) of the files, rather than the original file sizes. That seems more relevant to the problem at hand. (This could be made optional, if someone wants the uncompressed sizes for some reason.)

I also added an option to only report on files that have been deleted, on the assumption that files still in use are probably less interesting. (The way I did that was a bit of a hack; suggestions welcome.)

The latest script is here. I can also copy it here if that's good StackOverflow etiquette? (It's ~180 lines long.)

Iago answered 7/9, 2021 at 17:54 Comment(2)
Nice. This is much more readable than my original script. I borrowed your technique of using %(objectsize:disk)Quixote
Yes, it is definitely considered good etiquette to include the script. I though it was actually some sort of rule, but I'm having some difficulty finding the site rules in the help center...Ultrasonic
A
-1

Create new branch where current commit is the initial commit with all history gone to reduce git objects and history size.

Note: Please read the comment before running the code.

  1. git checkout --orphan latest_branch
  2. git add -A
  3. git commit -a -m “Initial commit message” #Committing the changes
  4. git branch -D master #Deleting master branch
  5. git branch -m master #renaming branch as master
  6. git push -f origin master #pushes to master branch
  7. git gc --aggressive --prune=all # remove the old files
Assiut answered 19/5, 2021 at 4:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.