Are there any good workarounds to the GitHub 100MB file size limit for text files?
Asked Answered
J

3

34

I have a 190 MB plain text file that I want to track on github.

The text file is a pronounciation lexicon file for our text-to-speech engine. We regularly add and modify lines in the text files, and the diffs are fairly small, so it's perfect for git in that sense.

However, GitHub has a strict 100 MB file size limit in place. I have tried the GitHub Large File Storage service, but that uploads a new version of the entire 190 MB file every time it changes - so that would quickly grow to many gigabytes if I go down that path.

I would like to keep the file as one file instead of splitting it because that's how our workflow is currently and it would require some coding to allow multiple text files as input/output in our tools (and we don't have much development resources).

One idea I've had is that maybe it's possible to set up some pre- and post-commit hooks to split and concatenate the big file automatically? Would that be possible?

Other ideas?

Edit: I am aware of the 100 MB file size limitation described in the similar questions here on StackOverflow, but I don't consider my question a duplicate because I'm asking for the specific case where the diffs are small and frequent (I'm not trying to upload a big ZIP file or anything). However, my understanding is that git-lfs is only appropriate for files that rarely change, and that normal git would be the perfect fit for the kind of file I'm describing; except that GitHub has a file size restriction.

Update: I spent yesterday experimenting with creating a small cross-platform program that splits and joins files into smaller files using git hooks. It kind of works but not really satisfactory. You will need to have your big text file excluded by .gitignore, which makes git unaware about whether or not it has changed. The split files are not initially detected by git status or git commit and leads to the same issue as described in this SO question, which is quite annoying: Pre-commit script creates mysqldump file, but "nothing to commit (working directory clean)"? Setting up a cron job (linux) and scheduled task (windows) to automatically regenerate the split files regularly might fix that, but it's not easy to automatically set up, might cause performance issues on the users computer, and is just not a very elegant solution. Some hacky solutions like dynamically modifying .gitignore might also be needed, and in no way would you get a diff of the actual text files, only the split files (although that might be acceptable as they would be very similar).

So, having slept on it, today I think the git hook approach is not a good option after all as it has too many quirks. As has been suggested by @PyRulez, I think I'll have to look at other services than GitHub (unfortunately, since I love github). A hosted solution would be preferable to avoid having to manage our own server. I'd also like it to be publically available...

Update 2: I've looked at some alternatives to GitHub and currently I'm leaning towards using GitLab. I've contacted GitHub support about the possibility of raising the 100MB limit, but if they won't do that I'll just switch to GitLab for this particular project.

Joellyn answered 11/1, 2016 at 14:21 Comment(12)
Possible duplicate of not able to push file more than 100mb to git hubTrilby
@Trilby I know this sounds similar to other questions, but this question regards the specific case where I have a text file which has frequent but small diffs and if that makes it possible to work around the 100 MB limitation somehow. I understand binaries would not be possible.Joellyn
I guess I did not understand the question well, already answered, sorry :)Trilby
No problem :), I should have been clearer.Joellyn
Maybe use something besides gitHub?Starry
@PyRulez I'm open for other suggestions if you know about other git services that allows me to track a 190 MB text file (although I kinda like having our Windows-users use GitHub Desktop).Joellyn
@Joellyn Dropbox (look up Dropbox+git)Starry
@Joellyn really, any file sharing thing would work (owncloud, bit torrent sync, etc...)Starry
@Joellyn OW, and gitHub for windows and mac apparently works with any git repo, not just gitHub (according to this), so you could have a git+dropbox+GitHub Desktop workflow!!! This link explains how.Starry
@PyRulez yeah, that seems pretty cool. It won't work with pull requests etc. but I think we can do fine without that feature. Using Dropbox with git has some downsides. As pointed out here, it can cause synchronization errors if multiple users try to push at the same time (we are a small team, but I'd like to avoid it anyway). Also, it's not straight forward to make the repo public using Dropbox I think. The best would be to find a hosted git service which allows bigger files I thinkJoellyn
@Joellyn I actually might have another solution (clean and smudge filters). If I have time, I'll write up an answer.Starry
@Joellyn (As a side note, Dropbox makes it very easy to be a public repo. The only problem is the 2GB free limit (which will include your history.))Starry
S
17

Clean and Smudge

You can use clean and smudge to compress your file. Normally, this isn't necessary, since git will compress it internally, but since gitHub is acting weird, it may help. The main commands would be like:

git config filter.compress.clean gzip
git config filter.compress.smudge gzip -d

GitHub will see this as a compressed file, but on each computer, it will appear to be a text file.

See https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes for more details.

Alternatively, you could have clean post to an online pastebin, and smudge fetch from the pastebin, such as http://pastebin.com/. Many other combinations are possible with clean and smudge.

Starry answered 14/1, 2016 at 2:19 Comment(9)
Interesting solution, thanks! This might make the 190MB smaller than 100MB. I assume the gzipped files won't be diffable though so each time the file changes, a new file would be created. If gzip compresses from 190MB to maybe 50MB, that's still 50 new MB for every commit.Joellyn
...maybe if instead of gzipping, the files could be split as I attempted with git hooks earlier. I'm currently leaning towards switching to GitLab instead of GitHub though, so I'll let that be a future experiment.Joellyn
@Joellyn see git-scm.com/book/en/v2/… for how to properly diff them.Starry
@Joellyn also, github for windows should work with gitlab, in case you were wondering. (Git is awesome.)Starry
Interesting! Thanks :). Git is indeed awesome.Joellyn
@Joellyn git-scm.com/docs/gitattributes has more in-depth materials for this answer.Starry
+1 This is an absolutely brilliant answer! I had only one file clocking in at 116MB. I added the two filters and then named the single file I needed compressed in .gitattributes. Elegant!Tyner
@pyrulez can you provide a little more info on what you add to the .gitattributes file?Injector
You should use gzip --rsyncable so that the resulting binary files are more amenable to binary diffing to reduce the size of the repository.Chihli
D
10

A very good solution will be to use:

https://git-lfs.github.com/

Its an open source designed to work with Large files.

Directly answered 11/1, 2016 at 20:23 Comment(3)
Yes, I've tried it, but I make changes to the text file frequently so it would create a new 190MB file in LFS very often. As I understand LFS, it's best for files that rarely change.Joellyn
I agree git-lfs in GitHub works well. The issue I ran into is that it has a bandwidth limit, which for an enterprise system will quickly be exceeded and/or become very expensive. Not only do they charge for the cost of storing the file, but in the context of bandwidth, you are paying every time you have developers pulling down your LFS repo or every pull. Even worse, if you have a CIS. Imagine a build system that has a binary that is 300MB is size and you have 1300 builds before you release. every build pulls down that Git LFS repo. You end up with GitHub becoming a bit expensive.Muhammad
Nice, this was exactly what I was looking for!Everick
T
4

You can create a script/program in any language to divide or unite files.

Here an example to divide a file written in Java (I used Java because I feel more comfortable on Java than any other, but any other would work, some will be better than Java too).

public static void main(String[] args) throws Exception
{
    RandomAccessFile raf = new RandomAccessFile("test.csv", "r");
    long numSplits = 10; //from user input, extract it from args
    long sourceSize = raf.length();
    long bytesPerSplit = sourceSize/numSplits ;
    long remainingBytes = sourceSize % numSplits;

    int maxReadBufferSize = 8 * 1024; //8KB
    for(int destIx=1; destIx <= numSplits; destIx++) {
        BufferedOutputStream bw = new BufferedOutputStream(new FileOutputStream("split."+destIx));
        if(bytesPerSplit > maxReadBufferSize) {
            long numReads = bytesPerSplit/maxReadBufferSize;
            long numRemainingRead = bytesPerSplit % maxReadBufferSize;
            for(int i=0; i<numReads; i++) {
                readWrite(raf, bw, maxReadBufferSize);
            }
            if(numRemainingRead > 0) {
                readWrite(raf, bw, numRemainingRead);
            }
        }else {
            readWrite(raf, bw, bytesPerSplit);
        }
        bw.close();
    }
    if(remainingBytes > 0) {
        BufferedOutputStream bw = new BufferedOutputStream(new FileOutputStream("split."+(numSplits+1)));
        readWrite(raf, bw, remainingBytes);
        bw.close();
    }
        raf.close();
}

static void readWrite(RandomAccessFile raf, BufferedOutputStream bw, long numBytes) throws IOException {
    byte[] buf = new byte[(int) numBytes];
    int val = raf.read(buf);
    if(val != -1) {
        bw.write(buf);
    }
}

This will cost almost nothing (Time/Money).

Edit: You can create a Java executable and add it to your repository, or even easier, create a Python (Or any other language) script to do this, and save it as plain text on your repository.

Trilby answered 11/1, 2016 at 14:55 Comment(3)
Thanks! Do you know if it would be possible to automatically run this before commiting and automatically merge after checking out?Joellyn
Check out the Unix/Linux split and cat commands. split -b 100M big-file big-file- ... cat big-file-* > big-fileJoke
@KeithThompson thanks. I knew about those but discarded the idea since I wanted it to work in Windows as well. However, it seems that git runs its git hooks in a bash environment even in Windows, so those commands might work there as well, I'm not sure. They would definitely be much simpler than implementing something myself (I created a small program in golang for testing).Joellyn

© 2022 - 2024 — McMap. All rights reserved.