Git: show total file size difference between two commits?
Asked Answered
S

8

90

Is it possible to show the total file size difference between two commits? Something like:

$ git file-size-diff 7f3219 bad418 # I wish this worked :)
-1234 bytes

I’ve tried:

$ git diff --patch-with-stat

And that shows the file size difference for each binary file in the diff — but not for text files, and not the total file size difference.

Any ideas?

Sirois answered 1/6, 2012 at 5:48 Comment(1)
Here is the 3-lines bashscript giving you size of certain commit https://mcmap.net/q/13617/-git-find-fat-commitSoutane
B
110

git cat-file -s will output the size in bytes of an object in git. git diff-tree can tell you the differences between one tree and another.

Putting this together into a script called git-file-size-diff located somewhere on your PATH will give you the ability to call git file-size-diff <tree-ish> <tree-ish>. We can try something like the following:

#!/bin/bash
USAGE='[--cached] [<rev-list-options>...]

Show file size changes between two commits or the index and a commit.'

SUBDIRECTORY_OK=1
. "$(git --exec-path)/git-sh-setup"
args=$(git rev-parse --sq "$@")
[ -n "$args" ] || usage
cmd="diff-tree -r"
[[ $args =~ "--cached" ]] && cmd="diff-index"
eval "git $cmd $args" | {
  total=0
  while read A B C D M P
  do
    case $M in
      M) bytes=$(( $(git cat-file -s $D) - $(git cat-file -s $C) )) ;;
      A) bytes=$(git cat-file -s $D) ;;
      D) bytes=-$(git cat-file -s $C) ;;
      *)
        echo >&2 warning: unhandled mode $M in \"$A $B $C $D $M $P\"
        continue
        ;;
    esac
    total=$(( $total + $bytes ))
    printf '%d\t%s\n' $bytes "$P"
  done
  echo total $total
}

In use this looks like the following:

$ git file-size-diff HEAD~850..HEAD~845
-234   Documentation/RelNotes/1.7.7.txt
112    Documentation/git.txt
-4     GIT-VERSION-GEN
43     builtin/grep.c
42     diff-lib.c
594    git-rebase--interactive.sh
381    t/t3404-rebase-interactive.sh
114    t/test-lib.sh
743    tree-walk.c
28     tree-walk.h
67     unpack-trees.c
28     unpack-trees.h
total 1914

By using git-rev-parse it should accept all the usual ways of specifying commit ranges.

EDIT: updated to record the cumulative total. Note that bash runs the while read in a subshell, hence the additional curly braces to avoid losing the total when the subshell exits.

EDIT: added support for comparing the index against another tree-ish by using a --cached argument to call git diff-index instead of git diff-tree. eg:

$ git file-size-diff --cached master
-570    Makefile
-134    git-gui.sh
-1  lib/browser.tcl
931 lib/commit.tcl
18  lib/index.tcl
total 244

EDIT: Mark script as capable of running in a subdirectory of a git repository.

Believe answered 1/6, 2012 at 8:54 Comment(14)
+1 Thanks! This would be absolutely perfect if it would print out the total size difference at the bottom. I want to see how many bytes were added/removed project-wide between two refs (not just per file, but in total, too).Sirois
Another question: why are you sourcing git-sh-setup here? You don’t seem to be using any of the functions it defines. Just wondering!Sirois
It does basic checks like producing a sensible message if you run this command in a directory that is not a git repository. It also can help abstract out some platform differences. Mostly habit though. When writing a git script - first bring in the git-sh-setup file.Believe
Thanks for the awesome script! I was looking for someway to monitor the increase of size after each commit and this helps a lot. I made a small gist to show only the total increase between all (some of) the commits in the repository gist.github.com/iamaziz/1019e5a9261132ac2a9a thanks again!Ridgley
The use case I'm looking for is to preview large commits before I make them. Is there a way I can find the size changes of the currently staged changes? I've read through the tree-ish documentation, and I could not find a way to reference "current staged changes".Loom
Added support for comparing against the index using git-diff-index.Believe
you can run echo $PATH to see your path directories to see where you can put this script file. I put mine in /usr/local/git/bin and it worked great. You can also add a path to your $PATH if you want to put the script somewhere else.Transubstantiation
How do I use this? What is HEAD~850? Can I just use instead the commit id?Agitato
@Agitato HEAD~850 is 850 commits before HEAD. It is just another notation for a commit and yes you can use a specific commit id or a tag or anything that can be resolved to a commit. The script uses git rev-parse so see the manual section "Specifying Revisions" in the git-rev-parse documentation for the full details. (git-scm.com/docs/git-rev-parse)Believe
How would i be able to see what size the files had before? I am currently preparing a pull request that optimizes file output structure and would like to calculate a percentage of size decrease.Roundlet
That looks great! Any way to make it work on Windows?Daube
This works on windows. Create the file in a directory that is on your PATH.Believe
Great script! Very useful. I've added it to my git-filesize-diff.sh file in my eRCaGuy_dotfiles repo. You can see the output from my modified version of your script in my commit message here.Isar
Nice script. However the $(git cat-file -s $D) - $(git cat-file -s $C) construct is problematic in the sense that it shows only the file size delta, not the file data delta. For example, you could have a file of 1024 bytes that has its content replace by different 1024 bytes, then the $(git cat-file -s $D) - $(git cat-file -s $C) construct would calculate a delta of 0 bytes, while the actual data delta is 1024 bytes.Stuckup
D
30

You can pipe the out put of

git show some-ref:some-path-to-file | wc -c
git show some-other-ref:some-path-to-file | wc -c

and compare the 2 numbers.

Discophile answered 1/6, 2012 at 6:18 Comment(4)
+1 This is great for quickly checking the size difference of a file between versions. But how can this be used to get the total file difference between two commits? I want to see how many bytes were added/removed project-wide between two refs.Sirois
You can skip the | wc -c if you use cat-file -s instead of showRyannryazan
Using the improvement suggested by @neu242, I wrote this bash function: gdbytes () { echo "$(git cat-file -s $1:$3) -> $(git cat-file -s $2:$3)" } Which makes it easy to see how file size changed since last commit with e.g., gdbytes @~ @ index.htmlThousandfold
if the some-ref: part is skipped, do you obtain the file size in the working directory?Whirly
G
4

Expanding on matthiaskrgr's answer, https://github.com/matthiaskrgr/gitdiffbinstat can be used like the other scripts:

gitdiffbinstat.sh HEAD..HEAD~4

Imo it really works well, much faster than anything else posted here. Sample output:

$ gitdiffbinstat.sh HEAD~6..HEAD~7
 HEAD~6..HEAD~7
 704a8b56161d8c69bfaf0c3e6be27a68f27453a6..40a8563d082143d81e622c675de1ea46db706f22
 Recursively getting stat for path "./c/data/gitrepo" from repo root......
 105 files changed in total
  3 text files changed, 16 insertions(+), 16 deletions(-) => [±0 lines]
  102 binary files changed 40374331 b (38 Mb) -> 39000258 b (37 Mb) => [-1374073 b (-1 Mb)]
   0 binary files added, 3 binary files removed, 99 binary files modified => [-3 files]
    0 b  added in new files, 777588 b (759 kb) removed => [-777588 b (-759 kb)]
    file modifications: 39596743 b (37 Mb) -> 39000258 b (37 Mb) => [-596485 b (-582 kb)]
    / ==>  [-1374073 b (-1 Mb)]

The output directory is funky with ./c/data... as /c is actually the filesytem root.

Gluteal answered 12/4, 2016 at 19:17 Comment(3)
You didn't need to comment on Matthias' post - you could have suggested an edit to it instead, with these details that he didn't provide. By current standards, his answer would be considered a "link-only answer", and be deleted, so these sorts of details are important.Helbonia
who can take my answer and include it into matthias?Gluteal
If you want, you can make a suggested edit yourself. (In my experience, it would tend to get get rejected by reviewers, but a clear explanation in the Edit Summary could help.) But maybe I wasn't clear in my comment to you... your answer is a stand-alone answer, a good update of Matthias' older answer. You didn't need to include the text that explained that you meant to comment, is all. I edited the answer in a way that gives appropriate credit to Matthias. You don't need to do more.Helbonia
I
3

I made a bash script to compare branches/commits etc by actual file/content size. It can be found at https://github.com/matthiaskrgr/gitdiffbinstat and also detects file renames.

Incult answered 29/12, 2012 at 1:41 Comment(1)
Got an example usage of this?Unicycle
J
2

A comment to the script: git-file-size-diff, suggested by patthoyts. The script is very useful, however, I have found two issues:

  1. When someone change permissions on the file, git returns a another type in the case statement:

    T) echo >&2 "Skipping change of type"
    continue ;;
    
  2. If a sha-1 value doesn't exist anymore (for some reason), the script crashes. You need to validate the sha before getting the file size:

    $(git cat-file -e $D) if [ "$?" = 1 ]; then continue; fi

The complete case statement will then look like this:

case $M in
      M) $(git cat-file -e $D)
         if [ "$?" = 1 ]; then continue; fi
         $(git cat-file -e $C)
         if [ "$?" = 1 ]; then continue; fi
         bytes=$(( $(git cat-file -s $D) - $(git cat-file -s $C) )) ;;
      A) $(git cat-file -e $D)
         if [ "$?" = 1 ]; then continue; fi
         bytes=$(git cat-file -s $D) ;;
      D) $(git cat-file -e $C)
         if [ "$?" = 1 ]; then continue; fi
         bytes=-$(git cat-file -s $C) ;;
      T) echo >&2 "Skipping change of type"
         continue ;;
      *)
        echo >&2 warning: unhandled mode $M in \"$A $B $C $D $M $P\"
        continue
        ;;
    esac
Jopa answered 22/6, 2017 at 9:40 Comment(0)
C
1

The Git core commands can make this much more efficient, instead of the postprocessing being three commands per blob it's three commands total:

filesizediffs() {
    git diff-tree "$@" \
    | awk '$1":"$2 ~ /:[10]0....:[10]0/ {
            print $3?$3:empty,substr($5,3)
            print $4?$4:empty,substr($5,3)
      }'  FS='[  ]' empty=`git hash-object -w --stdin <&-` \
    | git cat-file --batch-check=$'%(objectsize)\t%(rest)' \
    |  awk '!seen[$2]++ { first[$2]=$1 }
            $1!=first[$2] { print $1-first[$2],$2; total+=$1-first[$2] }
            END { print "total size difference "total }' FS=$'\t' OFS=$'\t'
}
filesizediffs @

on GNU/anything.

Crosscurrent answered 28/8, 2023 at 17:45 Comment(0)
S
0

If you’re happy with an approximate answer, you can get a back-of-napkin size of the data in a commit with:

git archive <COMMIT> | wc -c

The reported size will be the number of bytes of all the data in the commit plus some tar metadata. Since tar by itself (the default for git archive) doesn’t do compression the reported numbers are somewhat comparable.

If your intent is to find the one commit that added the 1GB log file, this approach is perfectly sufficient.

Spurgeon answered 28/8, 2023 at 16:29 Comment(0)
C
0

Usually, diff size (when dealing with source code) is meant as the number of added and removed lines. That is, size is judged according to the cognitive load imposed on a hypothetical reviewer of the commit. This takes into account that deletions are usually easier to review than additions.

git diff --shortstat

$ git diff --shortstat HEAD~5 HEAD
9 files changed, 117 insertions(+), 26 deletions(-)

git diff --stat

$ git diff --stat HEAD~5 HEAD
 .../java/org/apache/calcite/rex/RexSimplify.java   | 50 +++++++++++++++++-----
 .../apache/calcite/sql/fun/SqlTrimFunction.java    |  2 +-
 .../apache/calcite/sql2rel/SqlToRelConverter.java  | 16 +++++++
 .../org/apache/calcite/util/SaffronProperties.java | 19 ++++----
 .../org/apache/calcite/test/RexProgramTest.java    | 24 +++++++++++
 .../apache/calcite/test/SqlToRelConverterTest.java |  8 ++++
 .../apache/calcite/test/SqlToRelConverterTest.xml  | 15 +++++++
 pom.xml                                            |  2 +-
 .../apache/calcite/adapter/spark/SparkRules.java   |  7 +--
 9 files changed, 117 insertions(+), 26 deletions(-)

git diff --numstat

$ git diff --numstat HEAD~5 HEAD
40      10      core/src/main/java/org/apache/calcite/rex/RexSimplify.java
1       1       core/src/main/java/org/apache/calcite/sql/fun/SqlTrimFunction.java
16      0       core/src/main/java/org/apache/calcite/sql2rel/SqlToRelConverter.java
8       11      core/src/main/java/org/apache/calcite/util/SaffronProperties.java
24      0       core/src/test/java/org/apache/calcite/test/RexProgramTest.java
8       0       core/src/test/java/org/apache/calcite/test/SqlToRelConverterTest.java
15      0       core/src/test/resources/org/apache/calcite/test/SqlToRelConverterTest.xml
1       1       pom.xml
4       3       spark/src/main/java/org/apache/calcite/adapter/spark/SparkRules.java

Source of the usage examples: https://mcmap.net/q/14053/-how-to-list-only-the-names-of-files-that-changed-between-two-commits

Calaboose answered 7/1 at 10:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.