Inexplicable SVN repository size increase from small differences to big files

Asked 2/8, 2011 at 19:29 Answered 22/4, 2012 at 9:38

I can't figure out why small differences to big files are causing my subversion repository to grow so much.

I have a zip file of the contents a database used by some tests. I want to store each new version of the test data in our subversion repository.

I've done some experiments, checking in the last few versions of the data.zip and looking at what happens to the size of the repository. The uncompressed data is about 150MB, compressed and zipped it's ~50MB. Each new version of the data.zip file checked into the repository increases the repository's size by about 50MB. I think it should only increase by the amount of a delta which I expect to be much less.

Subversion uses xdelta to store compressed difference data. My attempt to confirm that SVN could do better was to download xdelta and check there isn't much difference between two versions. Indeed

xdelta3.0z.x86-64.exe -e -s v1_path\data.zip v2_path\data.zip v1v2_delta.file

produced a v1v2_delta.file which was about 3MB.

I've looked in the SVN repository at [myrepo]\db\revs and can see large files for each new revision

02/08/2011  11:12        57,853,082 4189
02/08/2011  11:40        51,713,289 4190
02/08/2011  11:46        52,286,060 4191

(The 4189, 4190 and 4191 are the names of files.)

I even tried zipping the data.zip without compression. This didn't make a difference to what SVN stores - from the look of it, my guess is that it is storing a compressed copy of the entire data.zip for every revision, not just the first. I'm running SVN 1.6 with an FSFS backend.

There are various other good stackoverflow answers about committing binaries and how SVN stores deltas, e.g. SVN performance after many revisions. But I cannot see from these why deltas aren't being stored in the above case - ie. if xdelta can get such a small diff running standalone, surely SVN can too - or is it choosing not to?!

Edit: I've also tried tar (uncompressed) files, again SVN isn't storing them efficiently. Also I found that we have a zip file of the same data format (although much smaller) in a different repository where SVN has just stored diffs.

So the summarized version of this question is: SVN can efficiently store binary files, e.g. 10 slightly different CAD files are just 1.2 times the size of 1. SVN even can be space efficient with compressed zip files sometimes. But evidently it isn't always space efficient with binary files - under what conditions is this the case?

Pacifist answered 2/8, 2011 at 19:29 Comment(1)

Regarding "avoid storing binary files". On Windows, this is unavoidable, especially if storing revisions of game-editor artifacts or office-based documents. "Avoid storing easily regenerable binary files" is more apt. The fact that svn can use binary deltas sets it apart from every other freely available source control system out there, as none of the others can do this -- they all recommit the binary fresh, which causes a large leap in the end size of the storage. – Macroscopic 24/1, 2012 at 19:32

Summary

Subversion will sometimes be worse than xdelta standalone because of how much memory is given to the compression. This is subversion behaviour that can't currently be changed, as of version 1.6.

Details

I asked on the subversion mailing list why the subversion repository files seemed to be bigger than they should be.

The conclusion is that xdelta can produce a smaller delta if you give it more memory.

Read back in this thread another example of someone else who had the same problem.

With credit and thanks to various people on subversion mailing lists recently and four years ago for this.

Also having this problem?

If you're analysing disk usage by the subversion repository, understand skip deltas and use this grep DELTA trick to figure out the base being used for the delta.

And assuming, like me, you really do want to store binary files in the repository, here's my guess at some workarounds (none of them very easy!):

Modify the subversion source code and build your own with the xdelta memory window set to be bigger
Do you own xdelta-ing - check the deltas into source control and have some crazy ass process for reconstructing
Migrate to Git - it's bound to have better compression (wild speculation)

Pacifist answered 9/8, 2011 at 19:8 Comment(0)

I would think that the compression will completely change the makeup of the binary file, therefore svn will have to store huge deltas. Even changing a few characters of the contents of a compressed file can drastically change it.

Storing binaries in source control is generally a bad idea and I think you should look for an alternative.

Porbeagle answered 2/8, 2011 at 19:39 Comment(4)

Re: compression completely changing the binary file - that's exactly what I was thinking, hence trying zipping without compression. But in any case, what I can't figure out is that when run standalone from the command line, xdelta manages to produce a small diff. Given SVN uses xdelta, surely it should also achieve a small diff? – Pacifist 3/8, 2011 at 12:49

What results do you see if you don't zip the database at all and just store it uncompressed? – Foot 7/8, 2011 at 22:45

In its raw format, the database data is vast tree of files of folders. I can commit a first version of this. But to commit a second version, I can't easily create a working copy - I can't just drop the second version on top of the first, because this messes up all the .svn folders. Unless there's some trick someone knows?... – Pacifist 8/8, 2011 at 18:20

Saying that storing binaries in version control is a bad idea is bad advice in my opinion. One should put whatever one needs to version into version control. Not versioning binaries or having them versioned somewhere else can be just as bad of an idea. In general I think it's much more helpful to have a tight connection between all your "source", no matter if it's text or otherwise, than to spread things out. Only consider alternatives if you have serious scaling problems, and you probably don't. Otherwise its probably your version control that's to blame, not your practices. – Grumous 29/6, 2017 at 14:15

Compressed files binary content might change drastically when files are added or modified in a compressed archive. Thought it can happen that changes can take place in particular elements of the archive and no significant changes happen in large areas of the compressed file file. However, it is a matter of "luck" that this will be the case in normal cases (of course there is no real luck in this but it is a bit complex to plan on achieving it)

This is quite normal in entropy encoding algorithms, such as Huffman (to name the simplest one), as the frequencies of the symbols change when files are added or modified. If this takes place at the beginning of the archive's contents, it can severely affect the entire content of the file following the change.

Mecham answered 22/4, 2012 at 9:38 Comment(0)

-1

Did you use the fsfs file system backing? As I recall, it stores a new copy each time (although it may be compressed). Why are you expecting SVN to store diffs of binary files? SVN is a source code control system (meaning text) not a general binary control system (although it doesn't do as badly as it could with storing binaries).

Knighthood answered 2/8, 2011 at 21:5 Comment(2)

Since Subversion 1.4 subversion.apache.org/docs/release-notes/1.4.html "Subversion uses the xdelta algorithm to compute differences between strings of bytes", ie. binary files too. – Pacifist 3/8, 2011 at 13:14

Subversion uses deltas for everything. It doesn't know or care if the files are source or binary. It just does a delta against the previous repo revision (assuming FSFS). – Soerabaja 11/8, 2011 at 18:11

Recommended topics

Hot tags