Synchronize Data across multiple occasionally-connected-clients using EventSourcing (NodeJS, MongoDB, JSON)
Asked Answered
T

2

19

I'm facing a problem implementing data-synchronization between a server and multiple clients. I read about Event Sourcing and I would like to use it to accomplish the syncing-part.

I know that this is not a technical question, more of a conceptional one.

I would just send all events live to the server, but the clients are designed to be used offline from time to time.

This is the basic concept:Visual Concept

The Server stores all events that every client should know about, it does not replay those events to serve the data because the main purpose is to sync the events between the clients, enabling them to replay all events locally.

The Clients have its one JSON store, also keeping all events and rebuilding all the different collections from the stored/synced events.

As clients can modify data offline, it is not that important to have consistent syncing cycles. With this in mind, the server should handle conflicts when merging the different events and ask the specific user in the case of a conflict.

So, the main problem for me is to dertermine the diffs between the client and the server to avoid sending all events to the server. I'm also having trouble with the order of the synchronization process: push changes first, pull changes first?

What I've currently built is a default MongoDB implementation on the serverside, which is isolating all documents of a specific user group in all my queries (Currently only handling authentication and server-side database work). On the client, I've built a wrapper around a NeDB store, enabling me to intercept all query operations to create and manage events per-query, while keeping the default query behaviour intact. I've also compensated for the different ID systems of neDB and MongoDB by implementing custom ids that are generated by the clients and are part of the document data, so that recreating a database won't mess up the IDs (When syncing, these IDs should be consistent across all clients).

The event format will look something like this:

{
   type: 'create/update/remove',
   collection: 'CollectionIdentifier',
   target: ?ID, //The global custom ID of the document updated
   data: {}, //The inserted/updated data
   timestamp: '',
   creator: //Some way to identify the author of the change
}

To save some memory on the clients, I will create snapshots at certain amounts of events, so that fully replaying all events will be more efficient.

So, to narrow down the problem: I'm able to replay events on the client side, I'm also able to create and maintain the events on the client and serverside, Merging the events on serverside should also not be a problem, Also replicating a whole database with existing tools is not an option as I'm only syncing certain parts of the database (Not even entire collections as the documents are assigned different groups in which they should sync).

But what I am having trouble with is:

  • The process of determining what events to send from the client when syncing (Avoid sending duplicate events, or even all events)
  • Determining what events to send back to the client (Avoid sending duplicate events, or even all events)
  • The right order of syncing the events (Push/Pull changes)

Another Question I would like to ask, is whether storing the updates directly on the documents in a revision-like style is more efficient?

If my question is unclear, duplicate (I found some questions, but they didnt help me in my scenario) or something is missing, please leave a comment, I will maintain it as best as I can to keep it simple, as I've just written everything down that could help you understand the concept.

Thanks in advance!

Twotime answered 28/2, 2017 at 10:37 Comment(0)
R
6

This is a very complex subject, but I'll attempt some form of answer.

My first reflex upon seeing your diagram is to think of how distributed databases replicate data between themselves and recover in the event that one node goes down. This is most often accomplished via gossiping.

Gossip rounds make sure that data stays in sync. Time-stamped revisions are kept on both ends merged on demand, say when a node reconnects, or simply at a given interval (publishing bulk updates via socket or the like).

Database engines like Cassandra or Scylla use 3 messages per merge round.

Demonstration:

Data in Node A

{ id: 1, timestamp: 10, data: { foo: '84' } }
{ id: 2, timestamp: 12, data: { foo: '23' } }
{ id: 3, timestamp: 12, data: { foo: '22' } }

Data in Node B

{ id: 1, timestamp: 11, data: { foo: '50' } }
{ id: 2, timestamp: 11, data: { foo: '31' } }
{ id: 3, timestamp: 8, data: { foo: '32' } }

Step 1: SYN

It lists the ids and last upsert timestamps of all it's documents (feel free to change the structure of these data packets, here I'm using verbose JSON to better illustrate the process)

Node A -> Node B

[ { id: 1, timestamp: 10 }, { id: 2, timestamp: 12 }, { id: 3, timestamp: 12 } ]

Step 2: ACK

Upon receiving this packet, Node B compares the received timestamps with it's own. For each documents, if it's timestamp is older, just place it in the ACK payload, if it's newer place it along with it's data. And if timestamps are the same, do nothing- obviously.

Node B -> Node A

[ { id: 1, timestamp: 11, data: { foo: '50' } }, { id: 2, timestamp: 11 }, { id: 3, timestamp: 8 } ]

Step 3: ACK2

Node A updates it's document if ACK data is provided, then sends back the latest data to Node B for those where no ACK data was provided.

Node A -> Node B

[ { id: 2, timestamp: 12, data: { foo: '23' } }, { id: 3, timestamp: 12, data: { foo: '22' } } ]

That way, both node now have the latest data merged both ways (in case the client did offline work) - without having to send all your documents.

In your case, your source of truth is your server, but you could easily implement peer-to-peer gossiping between your clients with WebRTC, for example.

Hope this helps in some way.

Cassandra training video

Scylla explanation

Redmund answered 28/2, 2017 at 17:32 Comment(10)
Thanks for the awnser! This is a very interesting approach! Its easy to implement and since my server controls the merging, I can resolve conflicts by some default rules instead of asking the user (If that causes headaches I can still implement some client side prompts). So in my case the server would be node A and would send all IDs and timestamps together? When the collections get bigger, that might be alot of data (even if its just the ID and timestamp) but I might find a efficient solution for that too.Twotime
I also like the idea that I could implement this between the different clients aswell, this would enable me to do faster syncing for clients in the same network.Twotime
This would also result in some saved space by avoiding duplicate data, as I dont need to store all data + all events on the client side.Twotime
You can create the business rules for what gets transferred based on what makes sense for you. You could look at the most recent timestamp stored in your client store and send it as a pre-SYN step so that you just get the most recent changes. Another option would be to tag gossip rounds by topic, and just sync the topics you need at a given time. You decide.Redmund
Yes thats what I need, awesome! The I will combine that with the topics so that I can sync the different stores independently and filter out data that should not be synced.Twotime
That also enables me to store all data from different usergroups in one collection on the server, thats pretty neat!Twotime
Glad I could help, I'd be interested to know how this turns out :)Redmund
I've tested different things, especially with event sourcing but that caused alot of bugs and was very inconsistent. Thanks for your detailed awnser, this really helps me solve my problem!Twotime
As soon as its implemented, I'll update with my final concept of approaching this in my usecase, as this was very hard to research :)Twotime
@JoschuaSchneider have you realized the implementation based on this approach? It's the most obvious path to follow I think, but not as easy as it seems. Making a first skeleton of exchange flow between 2 parties is easy, but there's a lot of situations to count for, let alone concurrent read/write and trigger management...Nitrometer
O
3

I think that the best solution to avoid all the event order and duplication issues are to use the pull method. In this way every client maintains its last imported event state (with a tracker for example) and ask the server for the events generated after that last one.

An interesting problem will be to detect the breaking of business invariants. For that you could store on the client the log of applied commands also and in case of a conflict (events were generated by other clients) you could retry the execution of commands from the command log. You need to do that because some commands will not succeed after re-execution; for example, a client saves a document after other user deleted that document in the same time.

Olivette answered 28/2, 2017 at 16:59 Comment(2)
Thanks for the awnser! So, when each client gets the latest state from the server, when or how would the clients push their changes to the server? I would prefer to resolve conflicts entirely on the client (Or by the specific client whos changes resulted in a conflict) so in that case the client side command log is a nice detail.Twotime
Just like in Git, you pull, resolve conflicts then pushOlivette

© 2022 - 2024 — McMap. All rights reserved.