Limiting file size in git repository
Asked Answered
D

11

31

I'm currently thinking of changing my VCS (from subversion) to git. Is it possible to limit the file size within a commit in a git repository? For e. g. subversion there is a hook: http://www.davidgrant.ca/limit_size_of_subversion_commits_with_this_hook

From my experience people, especially those who are inexperienced, sometimes tend to commit files which should not go into a VCS (e. g. big file system images).

Dehumanize answered 22/8, 2011 at 12:48 Comment(1)
the best answer for it is over at https://mcmap.net/q/13148/-how-to-limit-file-size-on-commitGlaciology
P
27

As I was struggling with it for a while, even with the description, and I think this is relevant for others too, I thought I'd post an implementation of how what J16 SDiZ described could be implemented.

So, my take on the server-side update hook preventing too big files to be pushed:

#!/bin/bash

# Script to limit the size of a push to git repository.
# Git repo has issues with big pushes, and we shouldn't have a real need for those
#
# eis/02.02.2012

# --- Safety check, should not be run from command line
if [ -z "$GIT_DIR" ]; then
        echo "Don't run this script from the command line." >&2
        echo " (if you want, you could supply GIT_DIR then run" >&2
        echo "  $0 <ref> <oldrev> <newrev>)" >&2
        exit 1
fi

# Test that tab replacement works, issue in some Solaris envs at least
testvariable=`echo -e "\t" | sed 's/\s//'`
if [ "$testvariable" != "" ]; then
        echo "Environment check failed - please contact git hosting." >&2
        exit 1
fi


# File size limit is meant to be configured through 'hooks.filesizelimit' setting
filesizelimit=$(git config hooks.filesizelimit)

# If we haven't configured a file size limit, use default value of about 100M
if [ -z "$filesizelimit" ]; then
        filesizelimit=100000000
fi

# Reference to incoming checkin can be found at $3
refname=$3

# With this command, we can find information about the file coming in that has biggest size
# We also normalize the line for excess whitespace
biggest_checkin_normalized=$(git ls-tree --full-tree -r -l $refname | sort -k 4 -n -r | head -1 | sed 's/^ *//;s/ *$//;s/\s\{1,\}/ /g' )

# Based on that, we can find what we are interested about
filesize=`echo $biggest_checkin_normalized | cut -d ' ' -f4,4`

# Actual comparison
# To cancel a push, we exit with status code 1
# It is also a good idea to print out some info about the cause of rejection
if [ $filesize -gt $filesizelimit ]; then

        # To be more user-friendly, we also look up the name of the offending file
        filename=`echo $biggest_checkin_normalized | cut -d ' ' -f5,5`

        echo "Error: Too large push attempted." >&2
        echo  >&2
        echo "File size limit is $filesizelimit, and you tried to push file named $filename of size $filesize." >&2
        echo "Contact configuration team if you really need to do this." >&2
        exit 1
fi

exit 0

Note that it's been commented that this code only checks the latest commit, so this code would need to be tweaked to iterate commits between $2 and $3 and do the check to all of them.

Predisposition answered 3/2, 2012 at 9:56 Comment(2)
How to use it? Execute this file every time before committing?Menology
Yes. But I don't know how to configure it in git.Menology
B
11

The answers by eis and J-16 SDiZ suffer from a severe problem. They are only checking the state of the finale commit $3 or $newrev. They need to also check what is being submitted in the other commits between $2 (or $oldrev) and $3 (or $newrev) in the udpate hook.

J-16 SDiZ is closer to the right answer.

The big flaw is that someone whose departmental server has this update hook installed to protect it will find out the hard way that:

After using git rm to remove the big file accidentally being checked in, then the current tree or last commit only will be fine, and it will pull in the entire chain of commits, including the big file that was deleted, creating a swollen unhappy fat history that nobody wants.

To solution is either to check each and every commit from $oldrev to $newrev, or to specify the entire range $oldrev..$newrev. Be darn sure you are not just checking $newrev alone, or this will fail with massive junk in your git history, pushed out to share with others, and then difficult or impossible to remove after that.

But answered 8/4, 2015 at 22:26 Comment(0)
B
8

This one is pretty good:

#!/bin/bash -u
#
# git-max-filesize
#
# git pre-receive hook to reject large files that should be commited
# via git-lfs (large file support) instead.
#
# Author: Christoph Hack <[email protected]>
# Copyright (c) 2017 mgIT GmbH. All rights reserved.
# Distributed under the Apache License. See LICENSE for details.
#
set -o pipefail

readonly DEFAULT_MAXSIZE="5242880" # 5MB
readonly CONFIG_NAME="hooks.maxfilesize"
readonly NULLSHA="0000000000000000000000000000000000000000"
readonly EXIT_SUCCESS="0"
readonly EXIT_FAILURE="1"

# main entry point
function main() {
  local status="$EXIT_SUCCESS"

  # get maximum filesize (from repository-specific config)
  local maxsize
  maxsize="$(get_maxsize)"
  if [[ "$?" != 0 ]]; then
    echo "failed to get ${CONFIG_NAME} from config"
    exit "$EXIT_FAILURE"
  fi

  # skip this hook entirely if maxsize is 0.
  if [[ "$maxsize" == 0 ]]; then
    cat > /dev/null
    exit "$EXIT_SUCCESS"
  fi

  # read lines from stdin (format: "<oldref> <newref> <refname>\n")
  local oldref
  local newref
  local refname
  while read oldref newref refname; do
    # skip branch deletions
    if [[ "$newref" == "$NULLSHA" ]]; then
      continue
    fi

    # find large objects
    # check all objects from $oldref (possible $NULLSHA) to $newref, but
    # skip all objects that have already been accepted (i.e. are referenced by
    # another branch or tag).
    local target
    if [[ "$oldref" == "$NULLSHA" ]]; then
      target="$newref"
    else
      target="${oldref}..${newref}"
    fi
    local large_files
    large_files="$(git rev-list --objects "$target" --not --branches=\* --tags=\* | \
      git cat-file $'--batch-check=%(objectname)\t%(objecttype)\t%(objectsize)\t%(rest)' | \
      awk -F '\t' -v maxbytes="$maxsize" '$3 > maxbytes' | cut -f 4-)"
    if [[ "$?" != 0 ]]; then
      echo "failed to check for large files in ref ${refname}"
      continue
    fi

    IFS=$'\n'
    for file in $large_files; do
      if [[ "$status" == 0 ]]; then
        echo ""
        echo "-------------------------------------------------------------------------"
        echo "Your push was rejected because it contains files larger than $(numfmt --to=iec "$maxsize")."
        echo "Please use https://git-lfs.github.com/ to store larger files."
        echo "-------------------------------------------------------------------------"
        echo ""
        echo "Offending files:"
        status="$EXIT_FAILURE"
      fi
      echo " - ${file} (ref: ${refname})"
    done
    unset IFS
  done

  exit "$status"
}

# get the maximum filesize configured for this repository or the default
# value if no specific option has been set. Suffixes like 5k, 5m, 5g, etc.
# can be used (see git config --int).
function get_maxsize() {
  local value;
  value="$(git config --int "$CONFIG_NAME")"
  if [[ "$?" != 0 ]] || [[ -z "$value" ]]; then
    echo "$DEFAULT_MAXSIZE"
    return "$EXIT_SUCCESS"
  fi
  echo "$value"
  return "$EXIT_SUCCESS"
}

main

You can configure the size in the serverside config file by adding:

[hooks]
        maxfilesize = 1048576 # 1 MiB
Blunder answered 17/9, 2019 at 15:17 Comment(1)
this is excellent and has lots of nice tricks and attention to details!Havildar
S
4

if you are using gitolite you can also try VREF. There is one VREF already provided by default (the code is in gitolite/src/VREF/MAX_NEWBIN_SIZE). It is called MAX_NEWBIN_SIZE. It works like this:

repo name
RW+     =   username
-   VREF/MAX_NEWBIN_SIZE/1000   =   usernames 

Where 1000 is example threshold in Bytes.

This VREF works like a update hook and it will reject your push if one file you are to push is greater than the threshold.

Stipple answered 11/2, 2015 at 13:51 Comment(0)
V
2

Yes, git has hooks as well (git hooks). But it kind of depends on the actually work-flow you will be using.

If you have inexperienced users, it is much safer to pull, then to let them push. That way, you can make sure they won't screw up the main repository.

Validity answered 22/8, 2011 at 12:53 Comment(0)
T
1

I want to highlight another set of approaches that address this issue at the pull request stage: GitHub Actions and Apps. It doesn't stop large files from being committed into a branch, but if they're removed prior to the merge then the resulting base branch will not have the large files in history.

There's a recently developed action that checks the added file sizes (through the GitHub API) against a user-defined reference value: lfs-warning.

I've also personally hacked together a Probot app to screen for large file sizes in a PR (against a user-defined value), but it's much less efficient: sizeCheck

Thick answered 16/6, 2020 at 15:11 Comment(0)
C
0

Another way is to version a .gitignore, which will prevent any file with a certain extension to show up in the status.
You still can have hooks as well (on downstream or upstream, as suggested by the other answers), but at least all downstream repo can include that .gitignore to avoid adding .exe, .dll, .iso, ...


If you are using hooks, consider Git 2.42 (Q3 2023): some atoms that can be used in "--format=<format>" for "git ls-tree"(man) were not supported by git ls-files(man), even though they were relevant in the context of the latter.

See commit 4d28c4f (23 May 2023) by ZheNing Hu (adlternative).
(Merged by Junio C Hamano -- gitster -- in commit 32fe7ff, 13 Jun 2023)

ls-files: align format atoms with ls-tree

Signed-off-by: ZheNing Hu

"git ls-files --format"(man) can be used to format the output of multiple file entries in the index, while "git ls-tree --format"(man) can be used to format the contents of a tree object.
However, the current set of %(objecttype), "(objectsize)", and "%(objectsize:padded)" atoms supported by "git ls-files --format" is a subset of what is available in git ls-tree --format(man)".

Users sometimes need to establish a unified view between the index and tree, which can help with comparison or conversion between the two.

Therefore, this patch adds the missing atoms to "git ls-files"(man) --format".

  • "%(objecttype)" can be used to retrieve the object type corresponding to a file in the index,
  • "%(objectsize)" can be used to retrieve the object size corresponding to a file in the index, and
  • "%(objectsize:padded)" is the same as "%(objectsize)", except with padded format.

git ls-files now includes in its man page:

objecttype

The object type of the file which is recorded in the index.

git ls-files now includes in its man page:

objectsize[:padded]

The object size of the file which is recorded in the index ("-" if the object is a commit or tree). It also supports a padded format of size with "%(objectsize:padded)".

Cu answered 22/8, 2011 at 12:54 Comment(1)
Note: hooks aren't propagated through clone: stackoverflow.com/questions/5165239/…)Cu
S
0

This is going to be a very rare case from what I have seen when some one checks in, say a 200Mb or even more size file.

While you can prevent this from happening by using server side hooks ( not sure about client side hooks since you have to rely on the person having the hooks installed ) much like how you would in SVN, you also have to take into account that in Git, it is much much easier to remove such a file / commit from the repository. You did not have such a luxury in SVN, atleast not an easy way.

Stallion answered 22/8, 2011 at 16:34 Comment(2)
Actually, in git isn't it more difficult? A 'git rm' of the file doesn't actually remove it from the repo, it just makes it not appear in later revisions. You still waste the space/bandwidth for it.Peppermint
@JosephGarvin - How? git rm is the command to remove a file from the current commit. It doesn't change history. You have other commands like git commit --amend and git filter-branchStallion
F
0

I am using gitolite and the update hook was already being used - instead of using the update hook, I used the pre-receive hook. The script posted by Chriki worked fabulously with the exception that the data is passed via stdin - so I made one line change:

- refname=$3
+ read a b refname

(there may be a more elegant way to do that but it works)

Fishy answered 29/8, 2014 at 21:41 Comment(0)
C
0

You need a solution that caters to the following scenarios.

  1. If someone is pushing multiple commits together, then the hook should check ALL the commits (between oldref and newref) in that push for files greater than a certain limit
  2. The hook should run for all users. If you write a client side hook, it will not be available for all users since such hooks are not pushed when you do a git push. So, what is needed is a server side hook such as a pre-receive hook.

This hook (https://github.com/mgit-at/git-max-filesize) deals with the above 2 cases and seems to also correctly handle edge cases such as new branch pushes and branch deletes.

Claudineclaudio answered 17/5, 2020 at 12:41 Comment(0)
N
-3

You can use a hook, either pre-commit hook (on client), or a update hook (on server). Do a git ls-files --cached (for pre-commit) or git ls-tree --full-tree -r -l $3 (for update) and act accordingly.

git ls-tree -l would give something like this:

100644 blob 97293e358a9870ac4ddf1daf44b10e10e8273d57    3301    file1
100644 blob 02937b0e158ff8d3895c6e93ebf0cbc37d81cac1     507    file2

Grab the forth column, and it is the size. Use git ls-tree --full-tree -r -l HEAD | sort -k 4 -n -r | head -1 to get the largest file. cut to extract, if [ a -lt b ] to check size, etc..

Sorry, I think if you are a programmer, you should be able to do this yourself.

Notarial answered 22/8, 2011 at 12:54 Comment(1)
@J-16SDiZ Very immature answer.Siliqua

© 2022 - 2024 — McMap. All rights reserved.