Binary Delta Storage

Asked 29/8, 2011 at 18:34 Answered 30/8, 2011 at 17:37

Solved version-control binary storage binary-data delta

I'm looking for a binary delta storage solution to version large binary files (digital audio workstation files)

When working with DAW files, the majority of changes, especially near the end of the mix are very small in comparison to the huge amount of data used to store raw data (waves).

It would be great to have a versioning system for our DAW files, allowing us to roll back to older versions.

The system would only save the difference between the binary files (diff) of each version. This would give us a list of instructions to change from the current version to the previous version without storing the full file for every single version.

Is there any current versioning systems that do this? I've read that SVN using binary diff's to save space in the repo... But I've also read that it doesn't actually do that for binary files only text files... Not sure. Any ideas?

My plan of action as of right now is to continue research into preexisiting tools, and if none exist, become comfortable with c/c++ reading binary data and creating the tool myself.

Exegesis answered 29/8, 2011 at 18:34 Comment(2)

Please don't repeat the same question on our site. Thanks. – Hagood 29/8, 2011 at 20:45

The repeated question was accidental due to (I think) a bug. I attempted to press add question a single time, but it gave me an error saying I need to wait 20 minutes before submitting. Afterwards I submitted again to see two questions rather than one... – Exegesis 29/8, 2011 at 21:8

I can't comment on the reliability or connection issues that might exist when committing a large file across the network (one referenced post hinted at problems). But here is a little bit of empirical data that you may find useful (or not).

I have been doing some tests today studying disk seek times and so had a reasonably good test case readily at hand. I found your question interesting, so I did a quick test with the files I am using/modifying. I created a local Subversion repository and added two binary files to it (sizes shown below) and then committed the files a couple of times after changes were made to them. The smaller binary file (.85 GB) simply had data added to the end of it each time. The larger file (2.2GB) contains data representing b-trees consisting of "random" integer data. The updates to that file between commits involved adding approximately 4000 new random values, so it would have modified nodes spread somewhat evenly throughout the file.

Here are the original file sizes along with the size/count of all files in the local subversion repository after the commit:

file1    851,271,675  
file2  2,205,798,400 

1,892,512,437 bytes in 32 files and 32 dirs

After the second commit:

file1    851,287,155  
file2  2,207,569,920  

1,894,211,472 bytes in 34 files and 32 dirs

After the third commit:

file1    851,308,845  
file2  2,210,174,976  

1,897,510,389 bytes in 36 files and 32 dirs

The commits were somewhat lengthy. I didn't pay close attention because I was doing other work, but I think each one took maybe 10 minutes. To check out a specific revision took about 5 minutes. I would not make a recommendation one way or other based on my results. All I can say is that it seemed to work fine and no errors occurred. And the file differencing seemed to work well (for these files).

Burgas answered 30/8, 2011 at 17:23 Comment(9)

Yes, that is about what you would expect from a binary delta so that seems to be working quite well... Hmm. I think I will have to try to do the same test on my own files with a local repo. – Exegesis 30/8, 2011 at 17:30

@Colton: Out of curiosity, I used 7-Zip (file compression utility) with the default settings and compressed those two files. It resulted in a 1.88 GB file. So the compression used by Subversion in this case seems correct too. They probably both used ZLib. – Burgas 30/8, 2011 at 17:33

@Colton: What DAW software are you using? Not that it really matters, but I'm curious. I use Cakewalk (Sonar) and have thought that I should use some kind of version control with my files but have never actually done it. This little test I did makes me think I may set it up at home. – Burgas 30/8, 2011 at 17:37

I'm using Reason as a DAW. Keep me posted on what you do and how it works :) – Exegesis 30/8, 2011 at 17:55

I'm guessing that since the actual delta is only a few megabytes the majority of the 10 minutes spent committing was probably doing the binary diff stuff. – Exegesis 31/8, 2011 at 23:18

That is probably correct. It has to read the repository data and decompress it, read the current file, compare/diff them, make changes, compress again, etc. Quite a lot going on. – Burgas 31/8, 2011 at 23:26

How long do you suppose the decompress and compression stages take approximately? It seems to me that the only advantage my own implementation of a binary delta storage version control system would have over svn is that it would not do those 2 stages, that and my tool would likely be single purpose and slightly faster because it's less flexible than svn. Now, the question is even if that time WAS significant would it be worth writing that application anyway... probably not... – Exegesis 1/9, 2011 at 0:1

Thats also assuming that my algorithm would be efficient as svn's which it likely wouldn't... but who knows. – Exegesis 1/9, 2011 at 0:3

One thing to watch out for - depending on exactly what your binary data is, subversion can get quite poor at creating deltas. My investigation into this is in this question. It seems one key thing to know about is the skip deltas stuff, which means each delta isn't necessarily calculated against the immediately preceeding version of the file. – Fumigant 6/9, 2011 at 13:54

Subversion might work, depending on your definition of large. This question/answer says that it works well as long as your files are less than 1 GB.

Wsan answered 29/8, 2011 at 18:45 Comment(0)

Subversion will perform binary deltas on binary files as well as text files. Subversion is just incapable of providing human-readable deltas for binary files, and cannot assist with merging conflicts in binary files.

Narcosis answered 29/8, 2011 at 20:23 Comment(1)

I accidentally posted this thread twice... but... on my other thread I created someone said Subversion might work, depending on your definition of large. This question/answer says that it works well as long as your files are less than 1 GB. Which is a problem, as almost all DAW files are going to be greater than a GB – Exegesis 29/8, 2011 at 20:38

-1

git compresses (you may need to call git gc manually though), and seemingly really good:

$ git init
$ dd if=/dev/urandom of=largefile bs=1M count=100
$ git add largefile
$ git commit -m 'first commit'
[master (root-commit) e474841] first commit
 1 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 largefile
$ du -sh .
201M    .
$ for i in $(seq 20); do date >> largefile; git commit -m "$i" -a; git gc; done
$ du -sh .
201M    .

Fennell answered 30/8, 2011 at 17:37 Comment(5)

This will probably fail if you use git on 32bit OS. – Barela 8/12, 2013 at 7:33

@yaruncan Can you elaborate why you believe it would fail, and why the bittiness of the OS should matter, of all things? I get the exact same output on a 32 bit system. – Fennell 8/12, 2013 at 18:9

phihag, 64 bit vs 32 bit os matters when you need alot of ram to these kinds of operations. I am mainly talking about git compression via things like git repack and git gc. In fact these operations always fail on my 32 bit linux, so I have to do those on another pc with 64 bit operating system – Barela 9/12, 2013 at 1:50

@yaruncan I disagree, the bittiness of the OS doesn't matter at all for that. What you mean is the available address space of processes, which is a different beast. If your repository is indeed large, some operations may not work. Note that this example with a 200MB file works fine though on a 32 Bit system with less than 1 GiB RAM. Also, newer git versions have optimized repack and gc signficantly. – Fennell 9/12, 2013 at 1:58

git packing and compression frequently fails on me no matter what packign limits I use. I was merely warning the poster about the dangers. – Barela 9/12, 2013 at 17:25

Recommended topics

Hot tags