Implement master-slave

Asked 22/9, 2012 at 15:39 Answered 23/9, 2012 at 5:21

Running on Ubuntu. Program is in C++. I have 2 process running on different hosts , when one is master and one is slave (don’t have any priority between them, just that only one handle request.). Only one process can be a master and handle request. Two process always up and in case they are crash there is a watch dog that restart them.

The hosts are connected by network cable.

My plan is to ask for keep alive from one to other and in case that slave stop getting keep alive from master it need to change its state to master. When master start up again it first wait to get keep alive and in case not get it set role as master. if get it set role as slave.

I will be happy to get your opinion on:

how to prevent from both to be master at the same time? This is my MAJOR concern. When start up and in connectvity failure, how do you prevent 2 master at the same time?

Do you think that it will be better to query for keep alive or to send keep alive? ( for my opinion its better to ask for keep alive than push )

any other good advices and pitfalls will be more than welcome.

Pimbley answered 22/9, 2012 at 15:39 Comment(0)

The way I've done this is to have each process spawn a heartbeat thread that sends out a UDP packet once a second, and listens for incoming UDP packets from the other process. If the heartbeat thread doesn't receive any UDP packets from the other process for a specified amount of time (e.g. 5 seconds), it assumes the other process is down and notifies the parent thread that it should be come the master now.

The reason the heartbeat sending/listening is done in a dedicated thread is because that way if the main thread is busy doing a lengthy calculation, it won't cause heartbeat UDP packets to temporarily not be sent. That way the algorithms in the main thread don't need to be real-time in order to avoid triggering spurious failovers.

There is another issue to think about here... what happens if a network problem temporarily cuts communication between the two hosts? (e.g. some joker or QA tester unplugs the Ethernet cable for 1 minute, then plugs it back in) In that case, both processes will stop receiving UDP packets from the other process, so both processes will think the other process has gone away, and both will become the master process. Then when the network cable is reconnected, you have two master processes running at once, which is not what you want. So you need some way for two master processes to decide which of the two should demote itself back to slave status, to satisfy the Highlander Principle ("there can be only one!"). This could be as simple as "the host with the smallest IP address should remain master", or you could have each heartbeat packet contain the sending process's uptime, and the host with the larger uptime should remain master, or etc.

Britisher answered 22/9, 2012 at 16:12 Comment(4)

I have seen similar ways of doing this. The major problem with the Master/Slave is to determine who should be the Master when there is some kind of an interruption such as LAN connectivity or if one is shut down and restarted. Who should be Master is really important if there is a persistent data storage replicated between the two and you have to determine who has the most up to date data in their copy of the persistent store. – Electric 22/9, 2012 at 16:30

One other approach to this is that if there is a period of missing heartbeats then each will then begin to actively send a query to the other. This will help to diagnose LAN connectivity problems. – Electric 22/9, 2012 at 16:33

This is the way that i implement it (thread that push UDP) , but i am still thinking to change to query and not push. but ... my biggest concern is to chose who is the master? i didnt understand how you overcome this in case of 2 process start up at the same time and how you overcome it when there is connectivity failure? – Pimbley 22/9, 2012 at 20:45

With two processes starting at the same time, you can have each process listen to incoming packets for a few seconds before deciding whether or not to make itself the master -- that way it has time to gather information about who else is out there, and whether the other processes are better suited to be master, or not. As for solving the problem when there is a connectivity failure -- I don't know a good solution for that (except maybe to have a backup network that can use if the primary network goes down?) – Britisher 28/9, 2012 at 2:45

The typical way to solve this problem is to hold an election. Everyone in the system shares the data that they'll use as input to the algorithm so that everyone can come to the same conclusion.

For example: the peers all (both) send each other some unique identifier (MAC address, or pid, or high-precision process start time, e.g.). Then each peer uses the same comparison to determine the winner (greatest value, e.g.). Then they inform each other of the results.

For the problem regarding transient connectivity faults, see the Byzantine Generals.

Recommended topics

Hot tags