How to handle Apache Curator Distributed Lock loss of connection

Say I have two distributed process running the following code that uses zookeeper and curator for a shared lock:

public static void main(String[] args) throws Exception {
    CuratorFramework client = CuratorFrameworkFactory.newClient("localhost:2181", new ExponentialBackoffRetry(500, 2));
    client.start();
    InterProcessMutex lock = new InterProcessMutex(client, "/12345");

    System.out.println("before acquire");
    lock.acquire();
    System.out.println("lock has been acquired");
    //do some things that need to be done in an atomic fashion
    lock.release();
    System.out.println("after release");
}

Where the "do some things" comment represents multiple statements that need to be done by only one process at a time. e.g. multiple writes into various databases.

This all seems fine until one of the java process loses the connection with zookeeper after it has acquired a lock.

According to the documentation:

It is strongly recommended that you add a ConnectionStateListener and watch for SUSPENDED and LOST state changes. If a SUSPENDED state is reported you cannot be certain that you still hold the lock unless you subsequently receive a RECONNECTED state. If a LOST state is reported it is certain that you no longer hold the lock.

If I understand this correctly, at any point after acquiring a lock I might receive a notification that the lock has been lost, due to a network issue, at which point some other process may have acquired the lock. If that is true there is no guarantee that after acquiring the lock you are the only process that has the lock. My precious statements that must only be executed by one process at a time might be interleaved with another process.

Have I misunderstood the above? If so please clarify what it means. If I have not misunderstood the above how is a curator lock useful if it cannot guarantee exclusive access?

It's a general rule of distributed systems: the network and other instances are unstable. If your instance loses contact with the ZooKeeper ensemble there is no way that you can be sure of the state of your lock node. This is what it means to get a SUSPENDED connection state change. Internally, ZooKeeper has notified Curator that the connection to its ZooKeeper instance has been lost.

This said, you can safely assume that no other instance will get the lock before your session times out so what you do is somewhat up to you. Further note that the meaning of the LOST connection state has changed in Curator 3.x. Prior to Curator 3.x the LOST state only meant that your retry policy had expired. In 3.x, Curator now sets an internal timer when the connection is SUSPENDED and the LOST connection state means that the session has expired. So, for many applications, you can safely ignore SUSPENDED and only exit your lock when LOST is received.

All of this aside. Even using a JDK lock in a single JVM you must be able to handle having your thread interrupted. Having your Curator application locks handle SUSPENDED/LOST is the same thing semantically.

Hope this helps (note I'm the main author of Apache Curator)

Recommended topics

Hot tags