Find files in git repo over x megabytes, that don't exist in HEAD
Asked Answered
H

11

58

I have a Git repository I store random things in. Mostly random scripts, text files, websites I've designed and so on.

There are some large binary files I have deleted over time (generally 1-5MB), which are sitting around increasing the size of the repository, which I don't need in the revision history.

Basically I want to be able to do..

me@host:~$ [magic command or script]
aad29819a908cc1c05c3b1102862746ba29bafc0 : example/blah.psd : 3.8MB : 130 days old
6e73ca29c379b71b4ff8c6b6a5df9c7f0f1f5627 : another/big.file : 1.12MB : 214 days old

..then be able to go though each result, checking if it's no longer required then removing it (probably using filter-branch)

Hidebound answered 18/11, 2008 at 10:0 Comment(0)
D
53

This is an adaptation of the git-find-blob script I posted previously:

#!/usr/bin/perl
use 5.008;
use strict;
use Memoize;

sub usage { die "usage: git-large-blob <size[b|k|m]> [<git-log arguments ...>]\n" }

@ARGV or usage();
my ( $max_size, $unit ) = ( shift =~ /^(\d+)([bkm]?)\z/ ) ? ( $1, $2 ) : usage();

my $exp = 10 * ( $unit eq 'b' ? 0 : $unit eq 'k' ? 1 : 2 );
my $cutoff = $max_size * 2**$exp; 

sub walk_tree {
    my ( $tree, @path ) = @_;
    my @subtree;
    my @r;

    {
        open my $ls_tree, '-|', git => 'ls-tree' => -l => $tree
            or die "Couldn't open pipe to git-ls-tree: $!\n";

        while ( <$ls_tree> ) {
            my ( $type, $sha1, $size, $name ) = /\A[0-7]{6} (\S+) (\S+) +(\S+)\t(.*)/;
            if ( $type eq 'tree' ) {
                push @subtree, [ $sha1, $name ];
            }
            elsif ( $type eq 'blob' and $size >= $cutoff ) {
                push @r, [ $size, @path, $name ];
            }
        }
    }

    push @r, walk_tree( $_->[0], @path, $_->[1] )
        for @subtree;

    return @r;
}

memoize 'walk_tree';

open my $log, '-|', git => log => @ARGV, '--pretty=format:%T %h %cr'
    or die "Couldn't open pipe to git-log: $!\n";

my %seen;
while ( <$log> ) {
    chomp;
    my ( $tree, $commit, $age ) = split " ", $_, 3;
    my $is_header_printed;
    for ( walk_tree( $tree ) ) {
        my ( $size, @path ) = @$_;
        my $path = join '/', @path;
        next if $seen{ $path }++;
        print "$commit $age\n" if not $is_header_printed++;
        print "\t$size\t$path\n";
    }
}
Dispensable answered 18/11, 2008 at 14:32 Comment(6)
I'm having difficulties understanding this code. Any examples of how to use your nice command?Alcaraz
aha. no arguments. it just took some time for it to output anything to the screen. git-large-blob 500kAlcaraz
I recommend mislav's answer over this one. It gave me more accurate answers.Jablon
Not work on Windows :-( "List form of pipe open not implemented"Bistort
I recommend passing --reverse to the script, so the listed commit IDs and times will actually correspond to when the big file was first introduced.Fanchette
The memoisation of the tree walk is what makes this a great script, however performance-wise it's still held back by a couple of things; the fork-and-exec-ing necessary to invoke the git executables, and (assuming I'm reading the perl correctly) the non-parallel execution. For comparison, executing against a medium-sized repo (JGit, ~2400 commits), "git-large-blob.pl 128k" takes ~1.5 minutes on my box, whereas The BFG (...my answer) takes only 4.8 seconds to run "bfg --strip-blobs-bigger-than 128K" (finds all large blobs not in your latest commit, eradicates them, and updates all refs).Stephanystephen
I
44

More compact ruby script:

#!/usr/bin/env ruby -w
head, treshold = ARGV
head ||= 'HEAD'
Megabyte = 1000 ** 2
treshold = (treshold || 0.1).to_f * Megabyte

big_files = {}

IO.popen("git rev-list #{head}", 'r') do |rev_list|
  rev_list.each_line do |commit|
    commit.chomp!
    for object in `git ls-tree -zrl #{commit}`.split("\0")
      bits, type, sha, size, path = object.split(/\s+/, 5)
      size = size.to_i
      big_files[sha] = [path, size, commit] if size >= treshold
    end
  end
end

big_files.each do |sha, (path, size, commit)|
  where = `git show -s #{commit} --format='%h: %cr'`.chomp
  puts "%4.1fM\t%s\t(%s)" % [size.to_f / Megabyte, path, where]
end

Usage:

ruby big_file.rb [rev] [size in MB]
$ ruby big_file.rb master 0.3
3.8M  example/blah.psd  (aad2981: 4 months ago)
1.1M  another/big.file  (6e73ca2: 2 weeks ago)
Impulsive answered 30/10, 2011 at 13:35 Comment(3)
This is a great answer but it does have one flaw. The large objects are stored in the hash big_files which uses sha as the unique key. In theory this is fine - each object blob is unique after all. However, in practise it is conceivable that you have exactly the same file in multiple locations in your repository. For example, this could be a test file which requires different filenames but not different physical content. Problems arise when you see a large object with a path that you do not need but unbeknownst to you, this same file exists somewhere else where it is needed.Spend
@Spend I added this line right before big_files[sha] = ... so I could at least know when that happens: warn "Another path for #{sha} is #{path}" if big_files.has_key? sha and big_files[sha][0] != pathPagandom
I've modified this script to make it more suitable for large repositories: do not process more than 1000 commits, show some output in console while working, avoid an error in git show command for files not in the working tree: gist.github.com/victor-homyakov/690cd2991c77539ca4feNivernais
K
17

Python script to do the same thing (based on this post):

#!/usr/bin/env python

import os, sys

def getOutput(cmd):
    return os.popen(cmd).read()

if (len(sys.argv) <> 2):
    print "usage: %s size_in_bytes" % sys.argv[0]
else:
    maxSize = int(sys.argv[1])

    revisions = getOutput("git rev-list HEAD").split()

    bigfiles = set()
    for revision in revisions:
        files = getOutput("git ls-tree -zrl %s" % revision).split('\0')
        for file in files:
            if file == "":
                continue
            splitdata = file.split()
            commit = splitdata[2]
            if splitdata[3] == "-":
                continue
            size = int(splitdata[3])
            path = splitdata[4]
            if (size > maxSize):
                bigfiles.add("%10d %s %s" % (size, commit, path))

    bigfiles = sorted(bigfiles, reverse=True)

    for f in bigfiles:
        print f
Kyat answered 18/11, 2008 at 10:0 Comment(4)
For bigfiles, it's better to just do bigfiles = sorted(set(bigfiles), reverse=True). Or, better yet, start it with bigfiles = set() and use bigfiles.add instead of bigfiles.append.Istic
Yes, but now there's no need to set it again! :)Istic
Worked great and I know python so this was my preference.Misjudge
Pathnames with spaces are not parsed correctly with this script. Changing the path = splitdata[4] line to path = ' '.join(splitdata[4:]) did it for me. Thanks for your script, super handy and readable!Bayberry
S
7

You want to use the BFG Repo-Cleaner, a faster, simpler alternative to git-filter-branch specifically designed for removing large files from Git repos.

Download the BFG jar (requires Java 6 or above) and run this command:

$ java -jar bfg.jar  --strip-blobs-bigger-than 1M  my-repo.git

Any files over 1M in size (that aren't in your latest commit) will be removed from your Git repository's history. You can then use git gc to clean away the dead data:

$ git gc --prune=now --aggressive

The BFG is typically 10-50x faster than running git-filter-branch and the options are tailored around these two common use-cases:

  • Removing Crazy Big Files
  • Removing Passwords, Credentials & other Private data

Full disclosure: I'm the author of the BFG Repo-Cleaner.

Stephanystephen answered 5/4, 2013 at 20:11 Comment(2)
@quetzalcoatl yup, the BFG has three namesakes: 1) the BFG weapon, 2) the Big Friendly Giant, and 3) ...it's the initials of git-filter-branch, written backwards :-) Believe me, The BFG is a splendid weapon for the destruction of data!Stephanystephen
BFG is great for actually deleting the files, but how to find out which files will be deleted?Pharisee
H
6

Ouch... that first script (by Aristotle), is pretty slow. On the git.git repo, looking for files > 100k, it chews up the CPU for about 6 minutes.

It also appears to have several wrong SHAs printed -- often a SHA will be printed that has nothing to do with the filename mentioned in the next line.

Here's a faster version. The output format is different, but it is very fast, and it is also -- as far as I can tell -- correct.

The program is a bit longer but a lot of it is verbiage.

#!/usr/bin/perl
use 5.10.0;
use strict;
use warnings;

use File::Temp qw(tempdir);
END { chdir( $ENV{HOME} ); }
my $tempdir = tempdir( "git-files_tempdir.XXXXXXXXXX", TMPDIR => 1, CLEANUP => 1 );

my $min = shift;
$min =~ /^\d+$/ or die "need a number";

# ----------------------------------------------------------------------

my @refs =qw(HEAD);
@refs = @ARGV if @ARGV;

# first, find blob SHAs and names (no sizes here)
open( my $objects, "-|", "git", "rev-list", "--objects", @refs) or die "rev-list: $!";
open( my $blobfile, ">", "$tempdir/blobs" ) or die "blobs out: $!";

my ( $blob, $name );
my %name;
my %size;
while (<$objects>) {
    next unless / ./;    # no commits or top level trees
    ( $blob, $name ) = split;
    $name{$blob} = $name;
    say $blobfile $blob;
}
close($blobfile);

# next, use cat-file --batch-check on the blob SHAs to get sizes
open( my $sizes, "-|", "< $tempdir/blobs git cat-file --batch-check | grep blob" ) or die "cat-file: $!";

my ( $dummy, $size );
while (<$sizes>) {
    ( $blob, $dummy, $size ) = split;
    next if $size < $min;
    $size{ $name{$blob} } = $size if ( $size{ $name{$blob} } || 0 ) < $size;
}

my @names_by_size = sort { $size{$b} <=> $size{$a} } keys %size;

say "
The size shown is the largest that file has ever attained.  But note
that it may not be that big at the commit shown, which is merely the
most recent commit affecting that file.
";

# finally, for each name being printed, find when it was last updated on each
# branch that we're concerned about and print stuff out
for my $name (@names_by_size) {
    say "$size{$name}\t$name";

    for my $r (@refs) {
        system("git --no-pager log -1 --format='%x09%h%x09%x09%ar%x09$r' $r -- $name");
    }
    print "\n";
}
print "\n";
Heterogeneous answered 13/1, 2012 at 17:6 Comment(0)
C
4

Aristote's script will show you what you want. You also need to know that deleted files will still take up space in the repo.

By default, Git keeps changes around for 30 days before they can be garbage-collected. If you want to remove them now:

$ git reflog expire --expire=1.minute refs/heads/master
     # all deletions up to 1 minute  ago available to be garbage-collected
$ git fsck --unreachable 
     # lists all the blobs(file contents) that will be garbage-collected 
$ git prune 
$ git gc

A side comment: While I am big fan of Git, Git doesn't bring any advantages to storing your collection of "random scripts, text files, websites" and binary files. Git tracks changes in content, particularly the history of coordinated changes among many text files, and does so very efficiently and effectively. As your question illustrates, Git doesn't have good tools for tracking individual file changes. And it doesn't track changes in binaries, so any revision stores another full copy in the repo.

Of course this use of Git is a perfectly good way to get familiar with how it works.

Countrywide answered 18/11, 2008 at 22:41 Comment(3)
There's no advantage to using git like this, but it handles it fine, and using a different VCS just because it handles binary files (or random bunches of files) better would be inconvenient (convenience being the only reason I keep the directory in git!)Hidebound
Git stores «another full copy» of any file, there is no difference is it a text file or a binary one! Though it can not show you changes in binary file.Jewel
It's worth noting that Git actually /does/ perform delta-compression on it's packfiles (https://mcmap.net/q/12449/-is-the-git-binary-diff-algorithm-delta-storage-standardized/438886) and so you don't necessarily pay for the storage of a modified binary file twice if it was a simple modification. However, I'd agree with @Countrywide that storing large files in Git is generally not a great idea. Git-annex (git-annex.branchable.com) might be a good way to combine the two if you really have to - I haven't used it myself.Stephanystephen
M
3
#!/bin/bash
if [ "$#" != 1 ]
then
  echo 'git large.sh [size]'
  exit
fi

declare -A big_files
big_files=()
echo printing results

while read commit
do
  while read bits type sha size path
  do
    if [ "$size" -gt "$1" ]
    then
      big_files[$sha]="$sha $size $path"
    fi
  done < <(git ls-tree --abbrev -rl $commit)
done < <(git rev-list HEAD)

for file in "${big_files[@]}"
do
  read sha size path <<< "$file"
  if git ls-tree -r HEAD | grep -q $sha
  then
    echo $file
  fi
done

Source

Myrmecophagous answered 16/5, 2012 at 19:14 Comment(1)
Nice and clean! (I even learnt something about bash!) Unfortunately, it is too slow for me (just too many objects in the repo). So I ended up using a Windows folder sizing tool (Explorer++) to find the largest folder, then object in .git, followed by a git rev-list --objects --all | grep <sha1> Its not very fancy, but worked for me.Mercury
D
3

This bash "one-liner" displays all blob objects in the repository that are larger than 10 MiB and are not present in HEAD sorted from smallest to largest.

It's very fast, easy to copy & paste and only requires standard GNU utilities.

git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| awk -v min_mb=10 '/^blob/ && $3 >= min_mb*2^20 {print substr($0,6)}' \
| grep -vFf <(git ls-tree -r HEAD | awk '{print $3}') \
| sort --numeric-sort --key=2 \
| cut -c 1-12,41- \
| $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

This will generate output like this:

2ba44098e28f   12MiB path/to/hires-image.png
bd1741ddce0d   63MiB path/to/some-video-1080p.mp4

For more information, including an output format more suitable for further script processing, see my original answer on a similar question.

macOS users: Since numfmt is not available on macOS, you can either omit the last line and deal with raw byte sizes or brew install coreutils.

Denni answered 6/9, 2017 at 21:41 Comment(2)
great, but fails with zsh: argument list too long: grepEssieessinger
@IlyaSheershoff I have updated the answer to avoid hitting the argument limit. Thanks for bringing this up.Denni
P
1

My python simplification of https://mcmap.net/q/13616/-find-files-in-git-repo-over-x-megabytes-that-don-39-t-exist-in-head

#!/usr/bin/env python
import os, sys

bigfiles = []
for revision in os.popen('git rev-list HEAD'):
    for f in os.popen('git ls-tree -zrl %s' % revision).read().split('\0'):
        if f:
            mode, type, commit, size, path = f.split(None, 4)
            if int(size) > int(sys.argv[1]):
                bigfiles.append((int(size), commit, path))

for f in sorted(set(bigfiles)):
    print f
Predesignate answered 27/4, 2012 at 0:35 Comment(0)
M
0

A little late to the party, but git-fat has this functionality built in.

Just install it with pip and run git fat -a find 100000 where the number at the end is in Bytes.

Markova answered 11/9, 2014 at 19:53 Comment(1)
It is now unmaitained, 10 years old!Dogy
L
0

As of 2023, the easiest answer to this question is now git-filter-repo. It's a single-file script that you can download here. Put it anywhere and run in it your repo with an --analyze argument. That will create a file .git/filter-repo/analysis/path-deleted-sizes.txt in your repo that contains exactly the information you want.

Luciano answered 28/9, 2023 at 9:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.