Are there any good workarounds to the GitHub 100MB file size limit for text files?

Asked 11/1, 2016 at 14:21 Answered 14/1, 2016 at 2:19

git github large-files pre-commit-hook post-commit-hook

I have a 190 MB plain text file that I want to track on github.

The text file is a pronounciation lexicon file for our text-to-speech engine. We regularly add and modify lines in the text files, and the diffs are fairly small, so it's perfect for git in that sense.

However, GitHub has a strict 100 MB file size limit in place. I have tried the GitHub Large File Storage service, but that uploads a new version of the entire 190 MB file every time it changes - so that would quickly grow to many gigabytes if I go down that path.

I would like to keep the file as one file instead of splitting it because that's how our workflow is currently and it would require some coding to allow multiple text files as input/output in our tools (and we don't have much development resources).

One idea I've had is that maybe it's possible to set up some pre- and post-commit hooks to split and concatenate the big file automatically? Would that be possible?

Other ideas?

Edit: I am aware of the 100 MB file size limitation described in the similar questions here on StackOverflow, but I don't consider my question a duplicate because I'm asking for the specific case where the diffs are small and frequent (I'm not trying to upload a big ZIP file or anything). However, my understanding is that git-lfs is only appropriate for files that rarely change, and that normal git would be the perfect fit for the kind of file I'm describing; except that GitHub has a file size restriction.

Update: I spent yesterday experimenting with creating a small cross-platform program that splits and joins files into smaller files using git hooks. It kind of works but not really satisfactory. You will need to have your big text file excluded by .gitignore, which makes git unaware about whether or not it has changed. The split files are not initially detected by git status or git commit and leads to the same issue as described in this SO question, which is quite annoying: Pre-commit script creates mysqldump file, but "nothing to commit (working directory clean)"? Setting up a cron job (linux) and scheduled task (windows) to automatically regenerate the split files regularly might fix that, but it's not easy to automatically set up, might cause performance issues on the users computer, and is just not a very elegant solution. Some hacky solutions like dynamically modifying .gitignore might also be needed, and in no way would you get a diff of the actual text files, only the split files (although that might be acceptable as they would be very similar).

So, having slept on it, today I think the git hook approach is not a good option after all as it has too many quirks. As has been suggested by @PyRulez, I think I'll have to look at other services than GitHub (unfortunately, since I love github). A hosted solution would be preferable to avoid having to manage our own server. I'd also like it to be publically available...

Update 2: I've looked at some alternatives to GitHub and currently I'm leaning towards using GitLab. I've contacted GitHub support about the possibility of raising the 100MB limit, but if they won't do that I'll just switch to GitLab for this particular project.

Joellyn answered 11/1, 2016 at 14:21 Comment(12)

Possible duplicate of not able to push file more than 100mb to git hub – Trilby 11/1, 2016 at 14:23

@Trilby I know this sounds similar to other questions, but this question regards the specific case where I have a text file which has frequent but small diffs and if that makes it possible to work around the 100 MB limitation somehow. I understand binaries would not be possible. – Joellyn 11/1, 2016 at 14:49

I guess I did not understand the question well, already answered, sorry :) – Trilby 11/1, 2016 at 14:57

No problem :), I should have been clearer. – Joellyn 11/1, 2016 at 15:0

Maybe use something besides gitHub? – Starry 11/1, 2016 at 21:58

@PyRulez I'm open for other suggestions if you know about other git services that allows me to track a 190 MB text file (although I kinda like having our Windows-users use GitHub Desktop). – Joellyn 12/1, 2016 at 12:25

@Joellyn Dropbox (look up Dropbox+git) – Starry 12/1, 2016 at 17:8

@Joellyn really, any file sharing thing would work (owncloud, bit torrent sync, etc...) – Starry 12/1, 2016 at 17:10

@Joellyn OW, and gitHub for windows and mac apparently works with any git repo, not just gitHub (according to this), so you could have a git+dropbox+GitHub Desktop workflow!!! This link explains how. – Starry 13/1, 2016 at 2:8

@PyRulez yeah, that seems pretty cool. It won't work with pull requests etc. but I think we can do fine without that feature. Using Dropbox with git has some downsides. As pointed out here, it can cause synchronization errors if multiple users try to push at the same time (we are a small team, but I'd like to avoid it anyway). Also, it's not straight forward to make the repo public using Dropbox I think. The best would be to find a hosted git service which allows bigger files I think – Joellyn 13/1, 2016 at 10:29

@Joellyn I actually might have another solution (clean and smudge filters). If I have time, I'll write up an answer. – Starry 13/1, 2016 at 12:7

@Joellyn (As a side note, Dropbox makes it very easy to be a public repo. The only problem is the 2GB free limit (which will include your history.)) – Starry 13/1, 2016 at 12:20

Clean and Smudge

You can use clean and smudge to compress your file. Normally, this isn't necessary, since git will compress it internally, but since gitHub is acting weird, it may help. The main commands would be like:

git config filter.compress.clean gzip
git config filter.compress.smudge gzip -d

GitHub will see this as a compressed file, but on each computer, it will appear to be a text file.

See https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes for more details.

Alternatively, you could have clean post to an online pastebin, and smudge fetch from the pastebin, such as http://pastebin.com/. Many other combinations are possible with clean and smudge.

Starry answered 14/1, 2016 at 2:19 Comment(9)

Interesting solution, thanks! This might make the 190MB smaller than 100MB. I assume the gzipped files won't be diffable though so each time the file changes, a new file would be created. If gzip compresses from 190MB to maybe 50MB, that's still 50 new MB for every commit. – Joellyn 14/1, 2016 at 13:2

...maybe if instead of gzipping, the files could be split as I attempted with git hooks earlier. I'm currently leaning towards switching to GitLab instead of GitHub though, so I'll let that be a future experiment. – Joellyn 14/1, 2016 at 13:5

@Joellyn see git-scm.com/book/en/v2/… for how to properly diff them. – Starry 14/1, 2016 at 21:50

@Joellyn also, github for windows should work with gitlab, in case you were wondering. (Git is awesome.) – Starry 14/1, 2016 at 21:52

Interesting! Thanks :). Git is indeed awesome. – Joellyn 15/1, 2016 at 13:7

@Joellyn git-scm.com/docs/gitattributes has more in-depth materials for this answer. – Starry 15/1, 2016 at 15:0

+1 This is an absolutely brilliant answer! I had only one file clocking in at 116MB. I added the two filters and then named the single file I needed compressed in .gitattributes. Elegant! – Tyner 3/11, 2016 at 1:40

@pyrulez can you provide a little more info on what you add to the .gitattributes file? – Injector 14/2, 2017 at 19:7

You should use gzip --rsyncable so that the resulting binary files are more amenable to binary diffing to reduce the size of the repository. – Chihli 29/8, 2018 at 10:19

A very good solution will be to use:

https://git-lfs.github.com/

Its an open source designed to work with Large files.

Directly answered 11/1, 2016 at 20:23 Comment(3)

Yes, I've tried it, but I make changes to the text file frequently so it would create a new 190MB file in LFS very often. As I understand LFS, it's best for files that rarely change. – Joellyn 12/1, 2016 at 12:23

I agree git-lfs in GitHub works well. The issue I ran into is that it has a bandwidth limit, which for an enterprise system will quickly be exceeded and/or become very expensive. Not only do they charge for the cost of storing the file, but in the context of bandwidth, you are paying every time you have developers pulling down your LFS repo or every pull. Even worse, if you have a CIS. Imagine a build system that has a binary that is 300MB is size and you have 1300 builds before you release. every build pulls down that Git LFS repo. You end up with GitHub becoming a bit expensive. – Muhammad 22/5, 2017 at 18:54

Nice, this was exactly what I was looking for! – Everick 11/5, 2020 at 11:10

You can create a script/program in any language to divide or unite files.

Here an example to divide a file written in Java (I used Java because I feel more comfortable on Java than any other, but any other would work, some will be better than Java too).

public static void main(String[] args) throws Exception
{
    RandomAccessFile raf = new RandomAccessFile("test.csv", "r");
    long numSplits = 10; //from user input, extract it from args
    long sourceSize = raf.length();
    long bytesPerSplit = sourceSize/numSplits ;
    long remainingBytes = sourceSize % numSplits;

    int maxReadBufferSize = 8 * 1024; //8KB
    for(int destIx=1; destIx <= numSplits; destIx++) {
        BufferedOutputStream bw = new BufferedOutputStream(new FileOutputStream("split."+destIx));
        if(bytesPerSplit > maxReadBufferSize) {
            long numReads = bytesPerSplit/maxReadBufferSize;
            long numRemainingRead = bytesPerSplit % maxReadBufferSize;
            for(int i=0; i<numReads; i++) {
                readWrite(raf, bw, maxReadBufferSize);
            }
            if(numRemainingRead > 0) {
                readWrite(raf, bw, numRemainingRead);
            }
        }else {
            readWrite(raf, bw, bytesPerSplit);
        }
        bw.close();
    }
    if(remainingBytes > 0) {
        BufferedOutputStream bw = new BufferedOutputStream(new FileOutputStream("split."+(numSplits+1)));
        readWrite(raf, bw, remainingBytes);
        bw.close();
    }
        raf.close();
}

static void readWrite(RandomAccessFile raf, BufferedOutputStream bw, long numBytes) throws IOException {
    byte[] buf = new byte[(int) numBytes];
    int val = raf.read(buf);
    if(val != -1) {
        bw.write(buf);
    }
}

This will cost almost nothing (Time/Money).

Edit: You can create a Java executable and add it to your repository, or even easier, create a Python (Or any other language) script to do this, and save it as plain text on your repository.

Trilby answered 11/1, 2016 at 14:55 Comment(3)

Thanks! Do you know if it would be possible to automatically run this before commiting and automatically merge after checking out? – Joellyn 11/1, 2016 at 15:12

Check out the Unix/Linux split and cat commands. split -b 100M big-file big-file- ... cat big-file-* > big-file – Joke 14/1, 2016 at 2:36

@KeithThompson thanks. I knew about those but discarded the idea since I wanted it to work in Windows as well. However, it seems that git runs its git hooks in a bash environment even in Windows, so those commands might work there as well, I'm not sure. They would definitely be much simpler than implementing something myself (I created a small program in golang for testing). – Joellyn 15/1, 2016 at 13:10

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Clean and Smudge

Recommended topics

Hot tags