I'm looking for implementations of set reconciliation algorithm. The problem is following: there are two sets with elements identified by some relatively compact value (e.g. UUID or MD5/SHA1/whatever hash) sitting on different machines. These sets differ in relatively few elements and I want to synchronize these sets while transferring minimal amount of data. Most of googling leads here. This is GPL'd implementation of what seems to be the state-of-art approach to the task. The problem is that I can't use GPL'd code in my app. Most likely I'll have to reimplement it myself using something like nzmath, but maybe there are other implementations (preferably Python or C/C++), or maybe there are other nicer algorithms?
Implementation of set reconciliation algorithm
Asked Answered
Couldn't you just put the keys in a file (sorted) and rsync it ? –
Positive
tonfa: it's one of possible solutions, but it doesn't exploit the fact that the ordering of data doesn't matter. Also, added/removed values will be evenly distributed across the file, causing rsync algorithm to transfer lots of excess data (perhaps a block per such value). –
Sonnysonobuoy
At run-time, will there be a known bound on the possible size of differences? –
Tongs
How big are the sets you're working with? Do you really need to use the state of the art? Or could you just compute the set difference and transfer only those elements? –
Gwendolin
Btw it's not clear if you want to transmit the keys, or data attached to keys (I assumed the keys). A possible solution (approximation) would be to use a bloom filter: en.wikipedia.org/wiki/… –
Positive
lost-theory: one of things this will be used for is syncing of simple content addressable storage that may contain over million files keyed by their md5sum. That's about 16 Mb to be transferred both ways. And what if I want to do it over pay-per-byte GPRS connection? –
Sonnysonobuoy
tonfa: yes, I know about bloom filters, this is one of possibilities I consider to use if I don't have time to implement the polynomial algorithm. Another possibility is use something along the lines of database sync algorithm. –
Sonnysonobuoy
Not being able to use GPL is often a matter of abstraction; that is if it is the license you have problems with. So if you create a small GPL application (released under GPL) you can call this from your non-GPL application. Why re-invent the wheel?
Especially if you can use a python script which already exists: why not leverage it? Of course things are different if you can not expose the element reconsolidation algorithms.
I've considered doing this, but I somehow feel bad about such GPL workaround as resulting GPL'd application will be rather useless without it's unGPL'd wrapper. –
Sonnysonobuoy
it would still comply with GPL. Other people may re-use ideas, so it still fits the GPL ideas. It beats re-inventing the wheel imho. –
Meiny
This code is out of my head, and thus covered by whatever license applies for code samples in this site.
# given two finite sequences of unique and hashable data,
# return needed opcodes and data needed for reconciliation
def set_reconcile(src_seq, dst_seq):
"Return required operations to mutate src_seq into dst_seq"
src_set= set(src_seq) # no-op if already of type set
dst_set= set(dst_seq) # ditto
for item in src_set - dst_set:
yield 'delete', item
for item in dst_set - src_set:
yield 'create', item
Use as follows:
for opcode, datum in set_reconcile(machine1_stuff, machine2_stuff):
if opcode == 'create':
# act accordingly
elif opcode == 'delete':
# likewise
else:
raise RuntimeError, 'unexpected opcode'
Ok, but you don't have both sets on any of machines, that's the problem. –
Sonnysonobuoy
If you can't transfer the list of keys from one machine, how will you transfer the items? Please re-specify your objection (if it is one). You asked for non-GPL code to reconcile two sets, I didn't see a request for code that manages transfers to and fro, transferred items being either unique IDs or items. –
Veradi
The thing is, if you have two LARGE sets of keys such as UUIDs sitting on different machines, a trick exists that allows one to get e.g. union of these sets on both sides without passing too much data (see link in the question). What happens after full key lists are obtained on both sides is trivial and thus of little interest for me. –
Sonnysonobuoy
The Synchronizing Keyserver project implements efficient set reconciliation in OCaml.
Do you if the implementation of the protocotol can be found in anything non OCaml? –
Heimlich
I have a freely-available C++ version at nislab.bu.edu. –
Convent
© 2022 - 2024 — McMap. All rights reserved.