Update data only by difference between files (delta for java)
Asked Answered
W

4

6

UPDATE: I solved the problem with a great external library - https://code.google.com/p/xdeltaencoder/. The way I did it is posted below as the accepted answer

Imagine I have two separate pcs who both have an identical byte[] A.

One of the pcs creates byte[] B, which is almost identical to byte[] A but is a 'newer' version.

For the second pc to update his copy of byte[] A into the latest version (byte[] B), I need to transmit the whole byte[] B to the second pc. If byte[] B is many GB's in size, this will take too long.

Is it possible to create a byte[] C that is the 'difference between' byte[] A and byte[] B? The requirements for byte[] C is that knowing byte[] A, it is possible to create byte[] B.

That way, I will only need to transmit byte[] C to the second PC, which in theory would be only a fraction of the size of byte[] B.

I am looking for a solution to this problem in Java.

Thankyou very much for any help you can provide :)

EDIT: The nature of the updates to the data in most circumstances is extra bytes being inserted into parts of the array. Ofcourse it is possible that some bytes will be changed or some bytes deleted. the byte[] itself represents a tree of the names of all the files/folders on a target pc. the byte[] is originally created by creating a tree of custom objects, marshalling them with JSON, and then compressing that data with a zip algorithm. I am struggling to create an algorithm that can intelligently create object c.

EDIT 2: Thankyou so much for all the help everyone here has given, and I am sorry for not being active for such a long time. I'm most probably going to try to get an external library to do the delta-encoding for me. A great part about this thread is that I now know what I want to achieve is called! I believe that when I find an appropriate solution I will post it and accept it so others can see as to how I solved my problem. Once again, thankyou very much for all your help.

Whiteheaded answered 24/1, 2014 at 11:1 Comment(4)
Seems like you'd just need to transmit a collection of (index,newValue) pairs which would be smaller than the full list assuming a small number of changes. Must the transmission be byte[], the main problem with that is that a byte cannot hold numbers large enough to hold the high indexes. Could you use a serialised collection to transmit?Epiglottis
So, you want to create a delta between A and B. The problem is figuring out how to map position x in C to A. That is, the delta (C), would only be the length of the number bytes that have changed, but how do you then map that back to the correct position in A. You could seed the position information into C, so that x would be the position and x+1 would be the actual data...for example. You would then need to consider run length encoding (so you would have x = postion, x+1 is length, then x+2+length is the data...Or you could run it all through a ZipStreamAdelaadelaida
@RichardTingle That's a very good point about byte not being big enough, didn't think about that...you could do some bit shifting (is that the right term) and use 4/8 bytes per position or some such...about here I'm thinking there has to be an easier way to do this...Adelaadelaida
Well written question.Wesla
W
1

So, what I ended up doing was using this:

https://code.google.com/p/xdeltaencoder/

From my test it works really really well. However, you will need to make sure to checksum the source (in my case fileAJson), as it does not do it automatically for you!

Anyways, code below:

//Create delta
String[] deltaArgs = new String[]{fileAJson.getAbsolutePath(), fileBJson.getAbsolutePath(), fileDelta.getAbsolutePath()};
XDeltaEncoder.main(deltaArgs);

//Apply delta
deltaArgs = new String[]{"-d", fileAJson.getAbsolutePath(), fileDelta.getAbsolutePath(), fileBTarget.getAbsolutePath()};
XDeltaEncoder.main(deltaArgs);

//Trivia, Surpisingly this also works
deltaArgs = new String[]{"-d", fileBJson.getAbsolutePath(), fileDelta.getAbsolutePath(), fileBTarget.getAbsolutePath()};
XDeltaEncoder.main(deltaArgs);
Whiteheaded answered 9/2, 2014 at 2:30 Comment(0)
E
3

Using a collection of "change events" rather than sending the whole array

A solution to this would be to send a serialized object describing the change rather than the actual array all over again.

public class ChangePair implements Serializable{
    //glorified struct
    public final int index;
    public final  byte newValue;

    public ChangePair(int index, byte newValue) {
        this.index = index;
        this.newValue = newValue;
    }

    public static void main(String[] args){

        Collection<ChangePair> changes=new HashSet<ChangePair>();

        changes.add(new ChangePair(12,(byte)2));
        changes.add(new ChangePair(1206,(byte)3));

    }
}

Generating the "change events"

The most efficient method for achieving this would be to track changes as you go, but assuming thats not possible you can just brute force your way through, finding which values are different

public static Collection<ChangePair> generateChangeCollection(byte[] oldValues, byte[] newValues){
    //validation
    if (oldValues.length!=newValues.length){
        throw new RuntimeException("new and old arrays are differing lengths");
    }

    Collection<ChangePair> changes=new HashSet<ChangePair>();

    for(int i=0;i<oldValues.length;i++){
        if (oldValues[i]!=newValues[i]){
            //generate a change event
            changes.add(new ChangePair(i,newValues[i]));
        }
    }

    return changes;
}

Sending and recieving those change events

As per this answer regarding sending serialized objects over the internet you could then send your object using the following code

Collection<ChangePair> changes=generateChangeCollection(oldValues,newValues);

Socket s = new Socket("yourhostname", 1234);
ObjectOutputStream out = new ObjectOutputStream(s.getOutputStream());
out.writeObject(objectToSend);
out.flush();

On the other end you would recieve the object

ServerSocket server = new ServerSocket(1234);
Socket s = server.accept();
ObjectInputStream in = new ObjectInputStream(s.getInputStream());
Collection<ChangePair> objectReceived = (Collection<ChangePair>) in.readObject();
//use Collection<ChangePair> to apply changes

Using those change events

This collection can then simply be used to modify the array of bytes on the other end

public static void useChangeCollection(byte[] oldValues, Collection<ChangePair> changeEvents){
    for(ChangePair changePair:changeEvents){
        oldValues[changePair.index]=changePair.newValue;
    }
}   
Epiglottis answered 24/1, 2014 at 11:26 Comment(5)
Any ideas on an algorithm that could determine change pair objects?Whiteheaded
@Whiteheaded assuming you can't generate them as you're making the changes to the first array; brute forcing your way through the array looking for changes between the two is likely to be faster than you imagineEpiglottis
However, I do not know how to create an algorithm that could brute force its way through the array looking for changes. This is why I am looking for help :DWhiteheaded
@Whiteheaded Ah fair enough, its 2am here now but I'll see what I can do tomorrowEpiglottis
@Whiteheaded Have added a couple of sections, "Generating the 'change events'" and "Using those change events" that should cover those issuesEpiglottis
H
1

Locally log the changes to the byte array, like a little version control system. In fact you could use a VCS to create patch files, send them to the other side and apply them to get an up-to-date file;

If you cannot log changes, you would need to double the array locally, or (not so 100% safe) use an array of checksums on blocks.

Hymie answered 24/1, 2014 at 11:20 Comment(4)
Unfortunately in my case the new byte[] is created from scratch. D: A VCS sounds very nice, though I am not sure as to how I would implement such a thing in Java. Are there any libraries that can help?Whiteheaded
Some like JGit (git) and SVNKit. The question is whether a VCS fits: you seem to have one huge binary "file".Hymie
Yes, that is correct, I do have one huge binary file that I need to update periodically. I don't need to keep track of changes though. Wouldn't a VCS's be slightly overkill for my purposes?Whiteheaded
Yes, a VCS has some ballast. The Unix commands diff to create a diff file and patch to apply a diff file would be interesting for text. Depending on the kind of changes a binary version might be of interest.Hymie
C
1

The main problem here is data compression.

Kamikaze offers you good compression algorithms for data arrays. It uses Simple16 and PForDelta coding. Simple16 is a good and (as the name says) simple list compression option. Or you can use Run Lenght Encoding. Or you can experiment with any compression algorithm you have available in Java...

Anyway, any method you use will be optimized if you first preprocess the data.

You can reduce the data calculating differences or, as @RichardTingle pointed, creating pairs of different data locations.

You can calculate C as B - A. A will have to be an int array, since the difference between two byte values can be higher than 255. You can then restore B as A + C.

The advantage of combining at least two methods here is that you get much better results.

E.g. if you use the difference method with A = { 1, 2, 3, 4, 5, 6, 7 } and B = { 1, 2, 3, 5, 6, 7, 7 }. The difference array C will be { 0, 0, 0, 1, 1, 1, 0 }. RLE can compress C in a very effective way, since it is good for compressing data when you have many repeated numbers in sequence.

Using the difference method with Simple16 will be good if your data changes in almost every position, but the difference between values is small. It can compress an array of 28 single-bit values (0 or 1) or an array of 14 two-bit values to a single 32-byte integer.

Experiment, it all will depend on how your data behaves. And compare the data compression ratios for each experiment.


EDIT: You will have to preprocess the data before JSON and zip compressing.

Create two sets old and now. The latter contains all files that exists now. For the former, the old files, you have at least two options:

  • Should contain all files that existed before you sent them to the other PC. You will need to keep a set of what the other PC knows to calculate what has changed since the last synchronization, and send only the new data.

  • Contains all files since you last checked for changes. You can keep a local history of changes and give each version an "id". Then, when you sync, you send the "version id" together with the changed data to the other PC. Next time, the other PC first sends its "version id" (or you keed the "version id" of each PC locally), then you can send the other PC all the new changes (all the versions that come after the one that PC had).

The changes can be represented by two other sets: newFiles, and deleted files. (What about files that changed in content? Don't you need to sync these too?) The newFiles contains the ones that only exist in set now (and do not exist in old). The deleted set contains the files that only exist in set old (and do not exist in now).

If you represent each file as an String with the full pathname, you safely will have unique representations of each file. Or you can use java.io.File.

After you reduced your changes to newFiles and deleted files set, you can convert them to JSON, zip and do anything else to serialize and compress the data.

Carryingon answered 24/1, 2014 at 11:53 Comment(5)
Neat idea, but imagine this. A = [10,5,7,9,26]. B = [10,2,5,7,9,26]. Inserting just one byte earlier on in the array will screw over all the data that follows. With this data set C = [0,-3,-2,-2,-17,26], which is not useful at all. A more intelligent algorithm would be able to determine that the data has simply moved one space to the right (something along the lines of [insert 2 at index 1]. I am having trouble writing such an algorithm though. Any ideas?Whiteheaded
@SuperMrBlob, can the values be repeated in the list or not (as an ordered set)? It would help if you describe all the operation that can be performed to change the list (as this: inserting an element at any position).Carryingon
@SuperMrBlob, can we know what exactly these bytes are representing?Carryingon
See the edited first post for details. The byte[] just represents the file/directory structure on the target computer. Kinda like when you go to cmd and type in dir /s. Thus it is largely expected for it not to change greatly, but operation like adding a new file, deleting a file or changing a files name will almost always occur. The values can definitely be repeated - if a folder has the same name than it is likely that they will be repeatedWhiteheaded
@SuperMrBlob, I edited my answer based on the new information you gave. In fact, you should not care about the byte[] because that is compressed data already and should not be changed directly (it can become invalid compressed data). You should care about what you can do before compressing it!Carryingon
W
1

So, what I ended up doing was using this:

https://code.google.com/p/xdeltaencoder/

From my test it works really really well. However, you will need to make sure to checksum the source (in my case fileAJson), as it does not do it automatically for you!

Anyways, code below:

//Create delta
String[] deltaArgs = new String[]{fileAJson.getAbsolutePath(), fileBJson.getAbsolutePath(), fileDelta.getAbsolutePath()};
XDeltaEncoder.main(deltaArgs);

//Apply delta
deltaArgs = new String[]{"-d", fileAJson.getAbsolutePath(), fileDelta.getAbsolutePath(), fileBTarget.getAbsolutePath()};
XDeltaEncoder.main(deltaArgs);

//Trivia, Surpisingly this also works
deltaArgs = new String[]{"-d", fileBJson.getAbsolutePath(), fileDelta.getAbsolutePath(), fileBTarget.getAbsolutePath()};
XDeltaEncoder.main(deltaArgs);
Whiteheaded answered 9/2, 2014 at 2:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.