Can the WAIT command provide strong consistency in Redis?

Thank you antirez for your answer. I will definitely accept your answer. However, I have this scenario in mind and I would like to hear your thoughts about it. I am trying to implement distributed locks using the WAIT operation, such that if WAIT fails with any error, the lock will not be acquired, and instead the lock keys will be deleted immediately or automatically expire eventually. Other clients will simply fail also to acquire the failed locks until they are deleted/expired. Assuming that I have handled other issues such as timing, is this a safe approach? – Recorder 11/11, 2015 at 9:38

If you want distributed locks with hard guarantees, you have a much better way to do this! The redlock algorithm is an implementation using N different masters, synchronous replication implemented client-side, and is available for several languages. I hope this helps! – Middelburg 11/11, 2015 at 10:37

Thanks antirez. I am currently using redlocks. However, I still would love to understand if my suggestion if safe. Currently, our solution does not require multi master setup (except if using redlocks), and we would like to explore using WAIT with a single master multi slave setup as it is simpler to manage. – Recorder 11/11, 2015 at 11:46

I don't think your algorithm is safe. For example what happens after a failover? If you pick the slave without a given lock, even if the lock itself reached other N/2+1 nodes, the SETNX in the (new) master will succeed and also WAIT will succeed, since now the other instances are replicating from the new master. If you don't failover, then you can just use a single instance :-) Since you have anyway a non available system during failures. Btw to handle a multi-master system is IMHO simpler, so I would pick redlock regardless. – Middelburg 11/11, 2015 at 11:52

Thank you for your time and effort to answer my questions. After failover all WAIT operations will fail with an error causing clients to also fail acquiring the locks. Hence, new clients can now safely acquire the same locks regardless of the newly selected master. Am I wrong? :) – Recorder 11/11, 2015 at 12:14

The problem is with the locks already provided to other clients. Example: <old master> gives lock and writes to N/2+1 servers. FAILVER HAPPENS HERE <new master> was not among N/2+1 servers, now all the slaves replicate from it, so the old lock no longer exists inside Redis. However a client is using it. However <new server> will provide the same pock to another client: safety violation, game over. – Middelburg 11/11, 2015 at 22:16

I really appreciate your patience with me :) But the WAIT command can be given a parameter that enforces all slaves to be synced to and acknowledge the sync before the WAIT is considered successful. This way we can guarantee that all slaves have the same view of locks. Am I missing something here? – Recorder 12/11, 2015 at 3:54

Yep as I said you are not considering that after failovers you may elect a slave which did not received the last lock (or any other past lock still valid from the POV of the client receiving it). – Middelburg 12/11, 2015 at 13:57

But that is impossible to happen since the client is WAITing for all slaves (using the number of slaves parameter that I mentioned before) to register the lock before the client claims the lock, otherwise the client considers the lock as not acquired. Could you please elaborate on how this can happen in this specific approach? – Recorder 13/11, 2015 at 6:46

Three servers A, B, C. A is the current master. Some client acquires a lock L1 writing to A and WAITing with replication factor 2. The write is received by A and B, since C was disconnected at the moment, or there was a lag in the replication channel, or whatever. Then A fails, the failover mechanism elects C as slave since for a temporary network partition B was not available (and A is still failing). Now the new master is C which has no notion of the lock L1, B and A become slaves of C, losing L1 key as well, so now L1 no longer exists. New client asks for L1 and acquires it again. FAIL – Middelburg 13/11, 2015 at 10:7

I edited the above comment multiple times, please reload & re-read. – Middelburg 13/11, 2015 at 10:10

Here is my understanding of this scenario. Client attempt to acquire L1 by writing to the master A and WAITing for the write to be replicated to all slaves B and C. Since C is disconnected at this moment, WAIT will return 1 as the number of slaves that acknowledged the write (namely B). At this time the client will see that not all slaves have acknowledged the write, hence it will fail to acquire the lock. Any acknowledged write can now be rolled back by the client via immediate deleting or automatically by the servers via eventual expiring. Both are safe. – Recorder 13/11, 2015 at 21:47

In general, any failing server (master or slave) will cause the locking to fail because the number of acknowledged writes will be less than 100%, which is required for strong consistency among all servers. I hope that I am not missing a very obvious thing here :) – Recorder 13/11, 2015 at 21:47

Yep you missed something (this violates availability) but this discussion is too long at this point :-) I suggest reading a few intro on distributed systems like this (not perfect at all but a decent start): book.mixu.net/distsys/single-page.html, the Raft paper, and in general how to achieve consistency in a replicated state machine. This will provide you some POV and tools in order to find easily the problems your schema has and why agreement is needed to solve this problem in the general case (if you want to use replication). – Middelburg 18/11, 2015 at 8:29

Redlock is explicitly a way to make the problem a lot simpler compared to replicated state machines, using the fact we need just specific guarantees in order to make the distributed lock to have certain safety and availability guarantees. But in order to control the synchronous replication needed for this use case, it is implemented client-side instead of rely to Redis replication that is designed in a different way. – Middelburg 18/11, 2015 at 8:30

Hi @antirez, what if I wait for the entire cluster (1 master and 2 slaves) to get the writes, and then only I acknowledge the client. Will that make it strongly consistent? (Given that I'm fine with cluster not accepting writes even if one of the machines goes down.) – Duchess 14/12, 2015 at 14:8

Yes @ptntialunrlsd, by totally sacrificing availability of any kind during failures, you are indeed safe. So your replicas will not allow to survive failures but will make writes safer because you can run the different nodes in different geographical locations. All this makes sense if you run AOF in fsync=always mode, otherwise during restarts each replica may lose writes making the schema weak. – Middelburg 15/12, 2015 at 17:8

Thanks @antirez! Yes, My master will be writing AOF with fsync=always and RDB will be disabled. The 2 slaves of this master won't be writing any logs. Will this be fully safe (given hard disk of master never crashes), or should I write RDB also in one of the slaves? (I don't feel the need, though.) – Duchess 16/12, 2015 at 6:58

@antirez, consider merging your last comment into your answer, since is pretty crucial. That is, the following two conditions guarantee strong consistency (at the expense of availability): 1. WAIT for every single replica to get writes; 2. Every single replica uses AOF persistence with fsync=always. – Lewis 14/9, 2016 at 8:41

Recommended topics

Hot tags