Git disk usage per branch
Asked Answered
G

6

14

Do you know if there is a way to list the space usage of a git repository per branch ? (like df or du would)

By "the space usage" for a branch I mean "the space used by the commits which are not yet shared accross other branches of the repository".

Gers answered 5/12, 2012 at 11:17 Comment(5)
Most file content will be present on more than one branch, so it's unlikely you can get something meaningful.Haircloth
I don't really understand your comment... I want something which would indicate me how much space each of my git branches is using...Gers
You git branches don't take space. That is, if you remove one of your branches you usually don't remove much content (even without taking into account compression). And the size of the repository can't be thought as the sum of the size of the branchs.Haircloth
I think that there is something to do with git verify-pack -v and some script... see git-scm.com/book/ch9-4.htmlGers
Aaah I finally did it with that command ! :)Gers
G
6

As it seems that nothing like that already exists, here is a Ruby script I did for that.

#!/usr/bin/env ruby -w
require 'set'

display_branches = ARGV

packed_blobs = {}

class PackedBlob
    attr_accessor :sha, :type, :size, :packed_size, :offset, :depth, :base_sha, :is_shared, :branch
    def initialize(sha, type, size, packed_size, offset, depth, base_sha)
        @sha = sha
        @type = type
        @size = size
        @packed_size = packed_size
        @offset = offset
        @depth = depth
        @base_sha = base_sha
        @is_shared = false
        @branch = nil
    end
end

class Branch
    attr_accessor :name, :blobs, :non_shared_size, :non_shared_packed_size, :shared_size, :shared_packed_size, :non_shared_dependable_size, :non_shared_dependable_packed_size
    def initialize(name)
        @name = name
        @blobs = Set.new
        @non_shared_size = 0
        @non_shared_packed_size = 0
        @shared_size = 0
        @shared_packed_size = 0
        @non_shared_dependable_size = 0
        @non_shared_dependable_packed_size = 0
    end
end

dependable_blob_shas = Set.new

# Collect every packed blobs information
for pack_idx in Dir[".git/objects/pack/pack-*.idx"]
    IO.popen("git verify-pack -v #{pack_idx}", 'r') do |pack_list|
        pack_list.each_line do |pack_line|
            pack_line.chomp!
            if not pack_line.include? "delta"
                sha, type, size, packed_size, offset, depth, base_sha = pack_line.split(/\s+/, 7)
                size = size.to_i
                packed_size = packed_size.to_i
                packed_blobs[sha] = PackedBlob.new(sha, type, size, packed_size, offset, depth, base_sha)
                dependable_blob_shas.add(base_sha) if base_sha != nil
            else
                break
            end
        end
    end
end

branches = {}

# Now check all blobs for every branches in order to determine whether it's shared between branches or not
IO.popen("git branch --list", 'r') do |branch_list|
    branch_list.each_line do |branch_line|
        # For each branch
        branch_name = branch_line[2..-1].chomp
        branch = Branch.new(branch_name)
        branches[branch_name] = branch
        IO.popen("git rev-list #{branch_name}", 'r') do |rev_list|
            rev_list.each_line do |commit|
                # Look into each commit in order to collect all the blobs used
                for object in `git ls-tree -zrl #{commit}`.split("\0")
                    bits, type, sha, size, path = object.split(/\s+/, 5)
                    if type == 'blob'
                        blob = packed_blobs[sha]
                        branch.blobs.add(blob)
                        if not blob.is_shared
                            if blob.branch != nil and blob.branch != branch
                                # this blob has been used in another branch, let's set it to "shared"
                                blob.is_shared = true
                                blob.branch = nil
                            else
                                blob.branch = branch
                            end
                        end
                    end
                end
            end
        end
    end
end

# Now iterate on each branch to compute the space usage for each
branches.each_value do |branch|
    branch.blobs.each do |blob|
        if blob.is_shared
            branch.shared_size += blob.size
            branch.shared_packed_size += blob.packed_size
        else
            if dependable_blob_shas.include?(blob.sha)
                branch.non_shared_dependable_size += blob.size
                branch.non_shared_dependable_packed_size += blob.packed_size
            else
                branch.non_shared_size += blob.size
                branch.non_shared_packed_size += blob.packed_size
            end
        end
    end
    # Now print it if wanted
    if display_branches.empty? or display_branches.include?(branch.name)
        puts "branch: %s" % branch.name
        puts "\tnon shared:"
        puts "\t\tpacked: %s" % branch.non_shared_packed_size
        puts "\t\tnon packed: %s" % branch.non_shared_size
        puts "\tnon shared but with dependencies on it:"
        puts "\t\tpacked: %s" % branch.non_shared_dependable_packed_size
        puts "\t\tnon packed: %s" % branch.non_shared_dependable_size
        puts "\tshared:"
        puts "\t\tpacked: %s" % branch.shared_packed_size
        puts "\t\tnon packed: %s" % branch.shared_size, ""
    end
end

With that one I was able to see that in my 2Mo git repository, I'd got one useless branch which took me 1Mo of blobs not shared with any other branches.

Gers answered 6/12, 2012 at 12:19 Comment(4)
I tried this script and got the following error: serv01.ams38.siteground.eu [~/www/cledu (dev)] ruby ../../diskspace.rb ../../diskspace.rb:75: undefined method is_shared' for nil:NilClass (NoMethodError)Bruni
This is amazing, just what I was looking for! I'm getting some warnings from it, though: git-branch-space.rb:70: warning: assigned but unused variable - bits, git-branch-space.rb:70: warning: assigned but unused variable - size, git-branch-space.rb:70: warning: assigned but unused variable - path It would be great if there was a little more documentation explaining the output.Caskey
@ChrisKepinski I found that error went away if I ran git gc before the script.Caskey
I guess this script needs to quote some strings passing to the command line. I am getting errors like: sh: -c: line 0: syntax error near unexpected token '(' sh: -c: line 0: 'git rev-list (HEAD detached at e12f391d2)'Shaffer
U
11

This doesn’t have a proper answer. If you look at the commits contained only in a specific branch, you would get a list of blobs (basically file versions). Now you would have to check whether these blobs are part of any of the commits in the other branches. After doing that you will have a list of blobs that are only part of your branch.

Now you could sum up the size of these blobs to get a result – but that would probably be very wrong. Git compresses these blobs against each other, so the actual size of a blob depends on what other blobs are in your repo. You could remove 1000 blobs, 10MB each and only free 1kb of disk space.

Usually a big repo size is caused by single big files in the repo (if not, you are probably doing something wrong :). Info on how to find those can be found here: Find files in git repo over x megabytes, that don't exist in HEAD

Unheardof answered 5/12, 2012 at 11:52 Comment(0)
G
6

As it seems that nothing like that already exists, here is a Ruby script I did for that.

#!/usr/bin/env ruby -w
require 'set'

display_branches = ARGV

packed_blobs = {}

class PackedBlob
    attr_accessor :sha, :type, :size, :packed_size, :offset, :depth, :base_sha, :is_shared, :branch
    def initialize(sha, type, size, packed_size, offset, depth, base_sha)
        @sha = sha
        @type = type
        @size = size
        @packed_size = packed_size
        @offset = offset
        @depth = depth
        @base_sha = base_sha
        @is_shared = false
        @branch = nil
    end
end

class Branch
    attr_accessor :name, :blobs, :non_shared_size, :non_shared_packed_size, :shared_size, :shared_packed_size, :non_shared_dependable_size, :non_shared_dependable_packed_size
    def initialize(name)
        @name = name
        @blobs = Set.new
        @non_shared_size = 0
        @non_shared_packed_size = 0
        @shared_size = 0
        @shared_packed_size = 0
        @non_shared_dependable_size = 0
        @non_shared_dependable_packed_size = 0
    end
end

dependable_blob_shas = Set.new

# Collect every packed blobs information
for pack_idx in Dir[".git/objects/pack/pack-*.idx"]
    IO.popen("git verify-pack -v #{pack_idx}", 'r') do |pack_list|
        pack_list.each_line do |pack_line|
            pack_line.chomp!
            if not pack_line.include? "delta"
                sha, type, size, packed_size, offset, depth, base_sha = pack_line.split(/\s+/, 7)
                size = size.to_i
                packed_size = packed_size.to_i
                packed_blobs[sha] = PackedBlob.new(sha, type, size, packed_size, offset, depth, base_sha)
                dependable_blob_shas.add(base_sha) if base_sha != nil
            else
                break
            end
        end
    end
end

branches = {}

# Now check all blobs for every branches in order to determine whether it's shared between branches or not
IO.popen("git branch --list", 'r') do |branch_list|
    branch_list.each_line do |branch_line|
        # For each branch
        branch_name = branch_line[2..-1].chomp
        branch = Branch.new(branch_name)
        branches[branch_name] = branch
        IO.popen("git rev-list #{branch_name}", 'r') do |rev_list|
            rev_list.each_line do |commit|
                # Look into each commit in order to collect all the blobs used
                for object in `git ls-tree -zrl #{commit}`.split("\0")
                    bits, type, sha, size, path = object.split(/\s+/, 5)
                    if type == 'blob'
                        blob = packed_blobs[sha]
                        branch.blobs.add(blob)
                        if not blob.is_shared
                            if blob.branch != nil and blob.branch != branch
                                # this blob has been used in another branch, let's set it to "shared"
                                blob.is_shared = true
                                blob.branch = nil
                            else
                                blob.branch = branch
                            end
                        end
                    end
                end
            end
        end
    end
end

# Now iterate on each branch to compute the space usage for each
branches.each_value do |branch|
    branch.blobs.each do |blob|
        if blob.is_shared
            branch.shared_size += blob.size
            branch.shared_packed_size += blob.packed_size
        else
            if dependable_blob_shas.include?(blob.sha)
                branch.non_shared_dependable_size += blob.size
                branch.non_shared_dependable_packed_size += blob.packed_size
            else
                branch.non_shared_size += blob.size
                branch.non_shared_packed_size += blob.packed_size
            end
        end
    end
    # Now print it if wanted
    if display_branches.empty? or display_branches.include?(branch.name)
        puts "branch: %s" % branch.name
        puts "\tnon shared:"
        puts "\t\tpacked: %s" % branch.non_shared_packed_size
        puts "\t\tnon packed: %s" % branch.non_shared_size
        puts "\tnon shared but with dependencies on it:"
        puts "\t\tpacked: %s" % branch.non_shared_dependable_packed_size
        puts "\t\tnon packed: %s" % branch.non_shared_dependable_size
        puts "\tshared:"
        puts "\t\tpacked: %s" % branch.shared_packed_size
        puts "\t\tnon packed: %s" % branch.shared_size, ""
    end
end

With that one I was able to see that in my 2Mo git repository, I'd got one useless branch which took me 1Mo of blobs not shared with any other branches.

Gers answered 6/12, 2012 at 12:19 Comment(4)
I tried this script and got the following error: serv01.ams38.siteground.eu [~/www/cledu (dev)] ruby ../../diskspace.rb ../../diskspace.rb:75: undefined method is_shared' for nil:NilClass (NoMethodError)Bruni
This is amazing, just what I was looking for! I'm getting some warnings from it, though: git-branch-space.rb:70: warning: assigned but unused variable - bits, git-branch-space.rb:70: warning: assigned but unused variable - size, git-branch-space.rb:70: warning: assigned but unused variable - path It would be great if there was a little more documentation explaining the output.Caskey
@ChrisKepinski I found that error went away if I ran git gc before the script.Caskey
I guess this script needs to quote some strings passing to the command line. I am getting errors like: sh: -c: line 0: syntax error near unexpected token '(' sh: -c: line 0: 'git rev-list (HEAD detached at e12f391d2)'Shaffer
S
3

Git maintains a directed acyclic graph of commits, with (in a simplistic sense) each commit using up disk space.

Unless all of your branches diverge from the very first commit, then there will be commits that are common to various branches, which means that each branch 'shares' some amount of disk space.

This makes it difficult to provide a 'per branch' figure of disk usage, as it would need to be qualified with what amount is shared, and with which other branches it is shared.

Schmit answered 5/12, 2012 at 11:34 Comment(2)
Actually, I want something which would tell me how much space is used only by this branch... so which would sum up the disk space used by the commits which are only in that specific branch... and which will list that for each branch (in order to help me to cut off the right branches of my git repo to make it smaller)Gers
Ok, that makes more sense - but I don't know if that makes it any easier to work out, due to the compression schemes that are used by git. If you're looking to cull oversized branches, it might be a better bet to look for those that have either a lot of non-shared commits, and/or those that have commits with significantly divergent changes to the last common commit. Even then, I'd say it's pretty speculative.Schmit
E
3

Most of the space of your repository is taken by the blobs containing the files.

But when a blob is shared by two branches (or two files with same content) it is not duplicated. The size of the repository can't be thought as the sum of the size of the branches. There is no such concept as the space taken by a branch.

And there is a lot of compression enabling to economize space on small file modifications.

Usually cutting off a branch will free only a very small, unpredictable, space.

Entropy answered 5/12, 2012 at 11:38 Comment(0)
T
3

I had the same problem this morning and wrote a quick script:

for a in $(git branch -a | grep remotes | awk '{print $1}' | sed 's/remotes\/origin\///'); do echo -n ${a} -\ ; git clean -d -x -f > /dev/null 2>&1 ;git checkout ${a} > /dev/null 2>&1; du -hs -I --exclude-dir=.git .;done

This will checkout every remote branch after resetting their content to make sure we cleanly checkout it. Then it will display the size without the .git directory.

With this, I was able to find the person who pushed a branch with a big file in it.

Please remember to do this in another cloned directory as it will wipe out everything that is not committed

Tensile answered 8/12, 2017 at 16:17 Comment(3)
it worked for me, except I need to change the command du -hs -I .git . into du -hs --exclude=.git .Groff
You are right @HilmanNihri, answer has been edited. Thank you!Tensile
please write Please remember to do this in another cloned directory as it will wipe out everything that is not committed in the front. I just lost my work...Fellowman
M
2

In git 2.3.1 it supports --disk-usage

# reachable objects
git rev-list --disk-usage --objects --all

https://git-scm.com/docs/git-rev-list#_examples

Moralez answered 16/3, 2021 at 5:35 Comment(2)
As far as I can see, this isn't really capturing what we want. This is showing how much space the latest files in the branch would use if you checked it out. I think we want to know (roughly) if the branch didn't exist (and everything was garbage collected) how much smaller would the repo be.Caskey
just checked, does not print anything on git 2.25.1Kwiatkowski

© 2022 - 2024 — McMap. All rights reserved.