Synchronize two ordered lists
Asked Answered
M

7

8

We have two offline systems that normally can not communicate with each other. Both systems maintain the same ordered list of items. Only rarely will they be able to communicate with each other to synchronize the list.

Items are marked with a modification timestamp to detect edits. Items are identified by UUIDs to avoid conflicts when inserting new items (as opposed to using auto-incrementing integers). When synchronizing new UUIDs are detected and copied to the other system. Likewise for deletions.

The above data structure is fine for an unordered list, but how can we handle ordering? If we added an integer "rank", that would need renumbering when inserting a new item (thus requiring synchronizing all successor items due to only 1 insertion). Alternatively, we could use fractional ranks (use the average of the ranks of the predecessor and successor item), but that doesn't seem like a robust solution as it will quickly run into accuracy problems when many new items are inserted.

We also considered implementing this as a doubly linked-list with each item holding the UUID of its predecessor and successor item. However, that would still require synchronizing 3 items when 1 new items was inserted (or synchronizing the 2 remaining items when 1 item was deleted).

Preferably, we would like to use a data structure or algorithm where only the newly inserted item needs to be synchronized. Does such a data structure exist?

Edit: we need to be able to handle moving an existing item to a different position too!

Margheritamargi answered 12/4, 2012 at 19:58 Comment(4)
If you have {a, b, c} on both systems, and system A inserts p to get {a, b, p, c}, and system B inserts p to get {a, p, b, c}, what order do you want to end up with when you sync?Infatuation
@Geoff, the chances of having two p's are virtually zero, as we are using random UUIDs.Margheritamargi
Sorry, you're right. What I was really wanting to ask was how you handle collisions in the sort order. Before I changed it, I wrote: If you have {a, b, c} on both systems, and system A inserts p to get {a, b, p, c}, and system B inserts q to get {a, b, q, c}, what order for p and q do you want to end up with when you sync?Infatuation
In that case, any order of p and q is acceptable, as long as both systems agree on the same order, obviously.Margheritamargi
B
5

There is really no problem with the interpolated rank approach. Just define your own numbering system based on variable length bit vectors representing binary fractions between 0 and 1 with no trailing zeros. The binary point is to the left of the first digit.

The only inconvenience of this system is that the minimum possible key is 0 given by the empty bit vector. Therefore you use this only if you're positive the associated item will forever be the first list element. Normally, just give the first item the key 1. That's equivalent to 1/2, so random insertions in the range (0..1) will tend to minimize bit usage. To interpolate an item before and after,

01 < newly interpolated = 1/4
1
11 < newly interpolated = 3/4

To interpolate again:

001 < newly interpolated = 1/8
01
011 < newly interpolated = 3/8
1
101 < newly interpolated = 5/8
11 
111  < newly interpolated = 7/8

Note that if you wish you can omit storing the final 1! All keys (except 0 which you won't normally use) end in 1, so storing it is supefluous.

Comparison of binary fractions is a lot like lexical comparison: 0<1 and the first bit difference in a left-to-right scan tells you which is less. If no differences occur, i.e. one vector is a strict prefix of the other, then the shorter one is smaller.

With these rules it's pretty simple to come up with an algorithm that accepts two bit vectors and computes a result that's roughly (or exactly in some cases) between them. Just add the bit strings, and shift right 1, dropping unnecessary trailing bits, i.e. take the average of the two to split the range between.

In the example above, if deletions had left us with:

01
111

and we need to interpolate these, add 01(0) and and 111 to obtain 1.001, then shift to get 1001. This works fine as an interpolant. But note the final 1 unnecessarily makes it longer than either of the operands. An easy optimization is to drop the final 1 bit along with trailing zeros to get simply 1. Sure enough, 1 is about half way between as we'd hope.

Of course if you do many inserts in the same location (think e.g. of successive inserts at the start of the list), the bit vectors will get long. This is exactly the same phenomenon as inserting at the same point in a binary tree. It grows long and stringy. To fix this, you must "rebalance" during a synchronization by renumbering with the shortest possible bit vectors, e.g. for 14 you'd use the sequence above.

Addition

Though I haven't tried it, the Postgres bit string type seems to suffice for the keys I've described. The thing I'd need to verify is that the collation order is correct.

Also, the same reasoning works just fine with base-k digits for any k>=2. The first item gets key k/2. There is also a simple optimization that prevents the very common cases of appending and prepending elements at the end and front respectively from causing keys of length O(n). It maintains O(log n) for those cases (though inserting at the same place internally can still produce O(p) keys after p insertions). I'll let you work that out. With k=256, you can use indefinite length byte strings. In SQL, I believe you'd want varbinary(max). SQL provides the correct lexicographic sort order. Implementation of the interpolation ops is easy if you have a BigInteger package similar to Java's. If you like human-readable data, you can convert the byte strings to e.g. hex strings (0-9a-f) and store those. Then normal UTF8 string sort order is correct.

Bodgie answered 6/8, 2014 at 1:40 Comment(7)
This is a very interesting premise. If you could, as you suggested, flesh out the details more, that would be great. For example, what would the data model look like for our two ordered items (would you be adding a binary bit vector column?), and how/when is the bit vector calculated?Luteous
@RyanNorbauer Note I was somewhat off in my explanation. I've fixed it and added the interpolation algorithm. Yes, you'd add a bit vector column as you say. The DB you are using will have to support the required lexicographic comparison of bit vectors or be extensible to allow it.Bodgie
Can you explain why the binary approach is better than simply using floating points? I'm sure at least that my database will know how to order those. :)Luteous
@RyanNorbauer Because with any finite precision rep you can run out of bits so that insertion can fail. This occurs when you can no longer compute the average of two existing keys (i.e. the answer returned is equal to one of the inputs). If that's okay, you're fine.Bodgie
@RyanNorbauer An example is double keys and repeated insertion at the same point in the list. After about 53 insertions you're out of luck.Bodgie
Ah, OK. Fascinating. So if I'm thinking about this right, your approach seems to offer unbounded precision. My only concern then is sorting in the db. I have never sorted on data like this before, and I'm not sure if SQL Server (the db I plan to use) natively supports something that would give the right result. I could always do it at the app level, but that seems sub-optimal. Your prior comment seemed to suggest that you knew of DBs that would support the required lexicographic comparison of binary data. Any suggestion on how I would go about searching for more info? Thx again!Luteous
@RyanNorbauer See the addition above.Bodgie
M
3

You can add two fields to each item - 'creation timestamp' and 'inserted after' (containing the id of the item after which the new item was inserted). Once you synchronize a list, send all the new items. That information is enough for you to be able to construct the list on the other side.

With the list of newly added items received, do this (on the receiving end): sort by creation timestamp, then go one by one, and use the 'inserted after' field to add the new item in the appropriate place.

You may face trouble if an item A is added, then B is added after A, then A is removed. If this can happen, you will need to sync A as well (basically syncing the operations that took place on the list since the last sync, and not just the content of the current list). It's basically a form of log-shipping.

Mediatory answered 12/4, 2012 at 20:8 Comment(2)
How would you handle moving an existing item to a different position in the list? Would you (ab)use the creation timestamp or do some magic with the modification timestamp?Margheritamargi
I'm curious to know what solution you settled on, Jason. (In SO, you can answer your own question.)Luteous
H
1

You could have a look at "lenses", which is bidirectional programming concept. For instance, your problem seems to be solved my "matching lenses", described in this paper.

Hoffert answered 22/5, 2012 at 4:38 Comment(1)
I'm very open to this, but all the set notation is gibberish to me. Do you know of any discussions of lenses from an engineering, practical, or industry perspective that is more approachable? Unfortunately, they chose a highly well-worn and ambiguous term for their programming concept, so it's hard to Google.Luteous
V
1

I think the datastructure that is appropriate here is order statistic tree. In order statistic tree you also need to maintain sizes of subtrees along with other data, the size field helps easy to find element by rank as you need it to be. All operations like rank,delete,change position,insert are O(logn).

Vanthe answered 30/7, 2014 at 6:38 Comment(0)
A
1

I think you can try kind of transactional approach here. For example you do not delete items physically but mark them for deletion and commit changes only during synchronization. I'm not absolutely sure which data type you should choose, it depends on which operations you want to be more productive (insertions, deletions, search or iteration).

Let we have the following initial state on both systems:

|1|   |2|
---   ---
|A|   |A|
|B|   |B|
|C|   |C|
|D|   |D|

After that the first system marks element A for deletion and the second system inserts element BC between B and C:

|1         |   |2           |
------------   --------------
|A         |   |A           |
|B[deleted]|   |B           |
|C         |   |BC[inserted]|
|D         |   |C           |
               |D           |

Both systems continue processing taking into account local changes, System 1 ignores element B and System 2 treats element BC as normal element.

When synchronization occurs:

As I understand, each system receives the list snapshot from other system and both systems freeze processing until synchronization will be finished.

So each system iterates sequentially through received snapshot and local list and writes changes to local list (resolving possible conflicts according to modified timestamp) after that 'transaction is commited', all local changes are finally applied and information about them erases. For example for system one:

|1 pre-sync|             |2-SNAPSHOT  |   |1 result|
------------             --------------   ----------
|A         | <the same>  |A           |   |A       |
|B[deleted]| <delete B>  |B           |
             <insert BC> |BC[inserted]|   |BC      |
|C         | <same>      |C           |   |C       |
|D         | <same>      |D           |   |D       |

Systems wake up and continue processing.

Items are sorted by insertion order, moving can be implemented as simultaneous deletion and insertion. Also I think that it will be possible not to transfer the whole list shapshot but only list of items that were actually modified.

Ammunition answered 2/8, 2014 at 23:30 Comment(0)
Y
1

I think, broadly, Operational Transformation could be related to the problem you are describing here. For instance, consider the problem of Real-Time Collaborative text editing.

We essentially have a sorted list of items( words) which needs to be kept synchronized, and which could be added/modified/deleted at random within the list. The only major difference I see is in the periodicity of modifications to the list.( You say it does not happen often)

Operational Transformation does happen to be well studied field. I could find this blog article giving pointers and introduction. Plus, for all the problems Google Wave had, they actually made significant advancements to the domain of Operational Transform. Check this out. . There is quite a bit of literature available on this subject. Look at this stackoverflow thread, and about Differential Synchronisation

Another parallel that struck me was the data structure used in Text Editors - Ropes. So if you have a log of operations,lets say, "Index 5 deleted", "Index 6 modified to ABC","Index 8 inserted",what you might now have to do is to transmit a log of the changes from System A to System B, and then reconstruct the operations sequentially on the other side.

The other "pragmatic Engineer" 's choice would be to simply reconstruct the entire list on System B when System A changes. Depending on actual frequency and size of changes, this might not be as bad as it sounds.

Youlandayoulton answered 4/8, 2014 at 10:29 Comment(0)
L
1

I have tentatively solved a similar problem by including a PrecedingItemID (which can be null if the item is the top/root of the ordered list) on each item, and then having a sort of local cache that keeps a list of all items in sorted order (this is purely for efficiency—so you don't have to recursively query for or build the list based on PrecedingItemIDs every time there is a re-ordering on the local client). Then when it comes time to sync I do the slightly more expensive operation of looking for cases where two items are requesting the same PrecedingItemID. In those cases, I simply order by creation time (or however you want to reconcile which one wins and comes first), put the second (or others) behind it, and move on ordering the list. I then store this new ordering in the local ordering cache and go on using that until the next sync (just making sure to keep the PrecedingItemID updated as you go).

I haven't unit tested this approach yet—so I'm not 100% sure I'm not missing some problematic conflict scenario—but it appears at least conceptually to handle my needs, which sound similar to those of the OP.

Luteous answered 4/8, 2014 at 15:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.