MongoDB ReplicaSet - PRIMARY role falls to SECONDARY when only PRIMARY is left

Asked 15/11, 2013 at 12:25 Answered 24/11, 2013 at 20:41

Solved mongodb configuration master-slave replicaset

I am investigating using MongoDB ReplicaSet for high availability.

But just discovered that in ReplicaSet with 3 nodes, if PRIMARY mongod is the only one left (that is 2 other mongod instances died or were shut down), then after several seconds it switches role to SECONDARY and accepts writes no more. That makes Replica Set worth less than single instance.

I know & understand about PRIMARY election, but the PRIMARY role is fixed to a server (by using priority set to ,say, 10) and (for example due to network problems) other servers become inaccessible, why the main server just gives up?!

Tested with 2.4.8 on Windows (mongodb-win32-x86_64-2008plus-2.4.8) and Linux (CentOS) and 2.0.x on Linux

BOUNTY STARTED:

If the replica set gives up when PRIMARY feels alone, what are alternative to ensure 100% availability? Or maybe there is special configuration needed for the case. The current implementation makes ReplicaSet fragile in case of network problems.

UPDATED:

Alas, I have not said before the scenario when #3 goes down (PRIMARY & SECONDARY are left) and then after a while SECONDARY goes down. Then PRIMARY really just "gives up", because it is already known that #3 is unavailable for some time. This was actually tested in my test environment.

var rsconfig = {"_id":"rs4","members":[{"_id":0,"host":"localhost:27041","priority":10},{"_id":1,"host":"localhost:27042"},{"_id":2,"host":"localhost:27043","arbiterOnly":true}]}
printjson(rsconfig)
rs.initiate(rsconfig)

We initially thought to put SECONDARY and #3 (that is ARBITER) on the same server, but because of question in title, we cannot use such configuration.

Thanks to Alan Spencer for first explaining the logic that MongoDB takes.

Amadis answered 15/11, 2013 at 12:25 Comment(3)

The bigger problem would be if in search of 100% availability you end up accepting writes on both sides of the partition - no matter what the system, something will have to resolve those conflicting writes. Just because you accept writes 100% of the time doesn't mean they will be in the system after network partition heals. I highly recommend "call me maybe" series for understanding of why this is hard: aphyr.com/posts/… – Gaynor 24/11, 2013 at 19:25

By the way, I think you misunderstood "priority" feature - it does not fix a role to a node, it simply influences elections when all else is equal. – Gaynor 24/11, 2013 at 19:29

I did understand. Maybe word "fix" is not the best, but while node with highest priority is up, it is PRIMARY – Amadis 26/11, 2013 at 3:21

This is expected, since the majority of the members are down MongoDB does not assume the last remaining member is consistent.

When you have a majority of the members down there are a couple of options: http://docs.mongodb.org/manual/tutorial/reconfigure-replica-set-with-unavailable-members/

Autopilot answered 15/11, 2013 at 12:29 Comment(14)

The imagined and simulated situation is when only PRIMARY is left (e.g. others are unavailable to network problem). The PRIMARY is the only mongod instance accepting writes, so it always has the latest data. The result of current MongoDB implementation is that within 5-10 seconds ReplicaSet seize to operate. – Amadis 18/11, 2013 at 1:57

@PaulVerest when only the primary is left of a 3 member replica set the majority is down...I am not following your problem – Autopilot 18/11, 2013 at 8:34

Majority is needed for election. I see no good reason for PRIMARY to give up and seize to accept writes. At least no documentation that says so were seen. – Amadis 18/11, 2013 at 9:53

@PaulVerest if the majority does not exist and the health heart beat of the other members returns fail for a specific amount of time an election will be initiated at which time the primary will become a secondary, I am sure this is in the docs somewhere – Autopilot 18/11, 2013 at 12:0

@PaulVerest it is also good to understand that the odd glitch in your network will go unnoticed, the primary wont consider other members down on the first ping. It is when there is a prolonged problem and an actual failure in your network that you will get this. You could try and put your members in different data centers, that should do the trick but then you might get network paritition problems – Autopilot 18/11, 2013 at 15:36

@PaulVerest - there is a good reason for the PRIMARY to give up being PRIMARY... As it can't see a majority, it doesn't know if they are all still up, can see each other and have elected a new PRIMARY - Hence the network partition problem mentioned by Sammaye – Tinder 19/11, 2013 at 18:7

@AlanSpencer That was logic behind (to prevent 2 PRIMARY appearing when there is network partitioning). It should be answer. I still wonder if that can be overridden. – Amadis 20/11, 2013 at 8:52

Don't understand "It should be answer". The design depends on how you want to deal with certain failure conditions and what infrastructure you have. For example, you may want to have one SECONDARY and an ARBITER, at least the ARBITER in a different availability zone. – Tinder 20/11, 2013 at 13:29

@Autopilot this is not why this happens. it happens because there is no way for primary to know whether the other two nodes are down or not. – Gaynor 24/11, 2013 at 0:46

@PaulVerest see my answer - majority is needed for election, but majority is also needed for a primary to remain primary. I.e. if a node couldn't be elected, then it cannot stay primary. – Gaynor 24/11, 2013 at 0:48

@AsyaKamsky huh? I believe I said that "the health heart beat of the other members returns fail for a specific amount of time an election will be initiated" – Autopilot 24/11, 2013 at 1:42

That's not accurate. Election is initiated when it's detected there is no primary. – Gaynor 24/11, 2013 at 9:25

@AsyaKamsky but there was a primary...so didn't the primary step down once it realised that the majority was down? – Autopilot 24/11, 2013 at 10:50

@AsyaKamsky Of course in my last comment I meant through an election failing – Autopilot 25/11, 2013 at 14:13

You say that when the primary is cut off from the other two nodes it should stay up, otherwise write availability is lost, but that's not necessarily the case. If the other two nodes are actually up and on the other side of the network partition, then they have elected a new primary (as two out of three are a majority) and it is that primary that is accepting new writes.

If the previous primary continued to accept writes, you would have potentially conflicting data which there is no mechanism to resolve. Since MongoDB replica set is a single primary architecture (as opposed to a multi-master system) the election mechanism assures that there cannot be two primaries at the same time.

From the point of view of two secondaries, network partition is the same as primary being unavailable, and from the primary's point of view, network partition is indistinguishable from "both other nodes are down". It steps down, because in case of network partition there may already be another primary on the other side of it, and it assures there cannot be two primaries by stepping down.

It is not the case that the "replica set" gives up when primary feels alone - the reason primary steps down when it feels alone is precisely to preserve the integrity of the replica set as a whole. It is not true that setting high priority score fixes a role to a node - a primary can only be elected via consensus among majority - all priority scores do is influence election when all other things are equal.

I highly recommend the excellent "call me maybe" series as reading to understand the challenges of write availability in a distributed system: http://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-and-the-perils-of-network-partitions

Gaynor answered 24/11, 2013 at 0:45 Comment(6)

Forgive me If I am wrong but he isn't dealing with a network partition here. Also isn't tour answer about consistency? – Autopilot 24/11, 2013 at 1:54

How would any node in a replica distinguish a network partition from scenario OP is dealing with? – Gaynor 24/11, 2013 at 9:26

I though I remember reading that MongoDB has an internal method of knowing when there is a partition in a replica config? – Autopilot 24/11, 2013 at 10:51

My point is that it is IMPOSSIBLE to differentiate between a network partition and the other nodes being down. There is no difference from each node's point of view! – Gaynor 24/11, 2013 at 19:20

Oh OK it just seems that through simple IP/location knowledge using the replica config it could be very easy to understand and I just remember reading somewhere that MongoDB did, but if it doesn't that's ok – Autopilot 24/11, 2013 at 21:4

It's not possible to know. If you ping a host and it's unreachable - how do you know if it's up or down? – Gaynor 24/11, 2013 at 21:7

Just to chime in on the answers. The behavior in this scenario is expected. MongoDB uses a leader election algorithm to elect the new leader. So if there is no majority you cannot elect a leader and hence no writes.

Your only option at the point where 2 nodes are down is to reconfigure your replica set as a 1 node replica set to make it writeable. You can do this using the rs.reconfig cmd with just one server. However please note that this should just be a temporary and emergency configuration. For the longer duration you should have an odd number of total nodes (3+) in your replica set configuration.

Follansbee answered 20/11, 2013 at 21:6 Comment(1)

It uses a: en.wikipedia.org/wiki/Quorum not sure what a leader election algorithm is if it even exists – Autopilot 20/11, 2013 at 21:42

-1

Try to use arbiters, most documents say to use just one, but in you case, you need to win the election.

From http://docs.mongodb.org/manual/core/replica-set-architectures/ :

Fault tolerance for a replica set is the number of members that can become unavailable and still leave enough members in the set to elect a primary. In other words, it is the difference between the number of members in the set and the majority needed to elect a primary. Without a primary, a replica set cannot accept write operations. Fault tolerance is an effect of replica set size, but the relationship is not direct.

More on elections: http://docs.mongodb.org/manual/core/replica-set-elections/

More on arbiters: http://docs.mongodb.org/manual/faq/replica-sets/#how-many-arbiters-do-replica-sets-need

Indopacific answered 24/11, 2013 at 20:41 Comment(2)

If the majority goes down, i.e. 2/3 I doubt a single or even two arbs will help much, this is a problem more substantial than simply using arbs – Autopilot 24/11, 2013 at 21:2

In fact to add to this I would consider this bad advice since to setup his replica set with what you are suggesting he would need a substantial number of arbs (if it were to even work in his scenario which it won't because in every single scenario he talks about he has the majority down) and so creating a lot of dud members in the set – Autopilot 24/11, 2013 at 21:7

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags