How to handle Apache Curator Distributed Lock loss of connection
Asked Answered
M

1

12

Say I have two distributed process running the following code that uses zookeeper and curator for a shared lock:

public static void main(String[] args) throws Exception {
    CuratorFramework client = CuratorFrameworkFactory.newClient("localhost:2181", new ExponentialBackoffRetry(500, 2));
    client.start();
    InterProcessMutex lock = new InterProcessMutex(client, "/12345");

    System.out.println("before acquire");
    lock.acquire();
    System.out.println("lock has been acquired");
    //do some things that need to be done in an atomic fashion
    lock.release();
    System.out.println("after release");
}

Where the "do some things" comment represents multiple statements that need to be done by only one process at a time. e.g. multiple writes into various databases.

This all seems fine until one of the java process loses the connection with zookeeper after it has acquired a lock.

According to the documentation:

It is strongly recommended that you add a ConnectionStateListener and watch for SUSPENDED and LOST state changes. If a SUSPENDED state is reported you cannot be certain that you still hold the lock unless you subsequently receive a RECONNECTED state. If a LOST state is reported it is certain that you no longer hold the lock.

If I understand this correctly, at any point after acquiring a lock I might receive a notification that the lock has been lost, due to a network issue, at which point some other process may have acquired the lock. If that is true there is no guarantee that after acquiring the lock you are the only process that has the lock. My precious statements that must only be executed by one process at a time might be interleaved with another process.

Have I misunderstood the above? If so please clarify what it means. If I have not misunderstood the above how is a curator lock useful if it cannot guarantee exclusive access?

Magniloquent answered 8/12, 2016 at 15:18 Comment(0)
E
13

It's a general rule of distributed systems: the network and other instances are unstable. If your instance loses contact with the ZooKeeper ensemble there is no way that you can be sure of the state of your lock node. This is what it means to get a SUSPENDED connection state change. Internally, ZooKeeper has notified Curator that the connection to its ZooKeeper instance has been lost.

This said, you can safely assume that no other instance will get the lock before your session times out so what you do is somewhat up to you. Further note that the meaning of the LOST connection state has changed in Curator 3.x. Prior to Curator 3.x the LOST state only meant that your retry policy had expired. In 3.x, Curator now sets an internal timer when the connection is SUSPENDED and the LOST connection state means that the session has expired. So, for many applications, you can safely ignore SUSPENDED and only exit your lock when LOST is received.

All of this aside. Even using a JDK lock in a single JVM you must be able to handle having your thread interrupted. Having your Curator application locks handle SUSPENDED/LOST is the same thing semantically.

Hope this helps (note I'm the main author of Apache Curator)

Entice answered 8/12, 2016 at 23:0 Comment(10)
In particular the lock does guarantee exclusive access while the connection is good and up to a session timeout's worth of time after a connection failure. And you can know there's a connection failure by noticing a SUSPENDED connection state change. Is that right?Lemur
I think the comparison with having your thread interrupted is not a fair comparison. An InterruptedException is only thrown from an interruptible blocking method, you don't get such an exception after you have acquired the lock mid way through your code. And even if the Thread.isInterrupted() status does get set after you acquire the lock, that doesn't mean someone else might acquire your lock. Just to be clear, I understand that you are stating that at the moment you receive a SUSPENDED/LOST notification you enter a race condition with your session time out, is that correct?Magniloquent
@Magniloquent I stand by my comparison to a JDK lock. Any time you use a JDK lock you have to have some mechanism to interrupt the thread that owns the lock. Otherwise you risk a dead lock. -- "at the moment you receive a SUSPENDED/LOST notification you enter a race condition with your session time out". It's not a race. ZooKeeper manages the lock. If you lost your connection to ZK you no longer have anything managing the lock. You are in an indeterminate state until the connection is repaired.Entice
@Randgalt, what is the best way to handle when the connection to zookeeper is lost while the exclusive operation is in progress? appreciate any insights into this.Shirleeshirleen
@Shirleeshirleen you should interrupt your operation and back out if possible. But, it depends a lot on what you're doing. So, it's not possible to give a blanket answer.Entice
@Entice yeah, I agree. the use case is this. we have a job that writes some information into a database as part of its execution. we need to make sure, no two instances of the same job are running at the same time. these are adhoc jobs triggered by users. the requests could land at different executors. so to ensure no two instances of the same job are running, we can take a distributed lock. the concern I've is, when the lock renewal is not possible, there will be a time lag before which we get to know. what happens when some other executor starts another instance. how to deal with it?Shirleeshirleen
Locks are live for as long as the ZooKeeper client session. So, it depends on the length of your session.Entice
@Entice wouldn't be helptful to add a parameter to the LeaderLatch so that it wouldn't report a notLeader event to the leader node until the connection is Lost? We find that transition between leaders are problematic and we wan't to avoid leadership changes if possible.Avi
@Avi that would be OK with me. Please submit a PREntice
@Entice I see that reconnect creates a new notLeader event as Zookeeper Curator seems to be electing the leader. (seems like the expected behaviour) May be what I need is a different event model in with notLeader event separated in different events like isFollower, isUnknown and isDisconnected...Avi

© 2022 - 2024 — McMap. All rights reserved.