I can't figure out why small differences to big files are causing my subversion repository to grow so much.
I have a zip file of the contents a database used by some tests. I want to store each new version of the test data in our subversion repository.
I've done some experiments, checking in the last few versions of the data.zip and looking at what happens to the size of the repository. The uncompressed data is about 150MB, compressed and zipped it's ~50MB. Each new version of the data.zip file checked into the repository increases the repository's size by about 50MB. I think it should only increase by the amount of a delta which I expect to be much less.
Subversion uses xdelta to store compressed difference data. My attempt to confirm that SVN could do better was to download xdelta and check there isn't much difference between two versions. Indeed
xdelta3.0z.x86-64.exe -e -s v1_path\data.zip v2_path\data.zip v1v2_delta.file
produced a v1v2_delta.file which was about 3MB.
I've looked in the SVN repository at [myrepo]\db\revs and can see large files for each new revision
02/08/2011 11:12 57,853,082 4189
02/08/2011 11:40 51,713,289 4190
02/08/2011 11:46 52,286,060 4191
(The 4189, 4190 and 4191 are the names of files.)
I even tried zipping the data.zip without compression. This didn't make a difference to what SVN stores - from the look of it, my guess is that it is storing a compressed copy of the entire data.zip for every revision, not just the first. I'm running SVN 1.6 with an FSFS backend.
There are various other good stackoverflow answers about committing binaries and how SVN stores deltas, e.g. SVN performance after many revisions. But I cannot see from these why deltas aren't being stored in the above case - ie. if xdelta can get such a small diff running standalone, surely SVN can too - or is it choosing not to?!
Edit: I've also tried tar (uncompressed) files, again SVN isn't storing them efficiently. Also I found that we have a zip file of the same data format (although much smaller) in a different repository where SVN has just stored diffs.
So the summarized version of this question is: SVN can efficiently store binary files, e.g. 10 slightly different CAD files are just 1.2 times the size of 1. SVN even can be space efficient with compressed zip files sometimes. But evidently it isn't always space efficient with binary files - under what conditions is this the case?