Configure Hazelcast CPSubsystem Retries Timeout
Asked Answered
P

2

1

Currently I have three instances registered in the CPSubsystem.

      ----- 
     | I1* | * Leader
      ----- 

 ----       ---- 
| I2 |     | I3 |
 ----       ---- 

When all instances are up an running, all registered and seeing each other on the CPSubsystem everything is working as expected. The following call is used to perform distributed locks between all the instances:

getHazelcastInstance().getCpSubsystem().getLock(lockDefinition.getLockEntryName())

I noticed an issue when two of these instances die, and there is no leader or other instances available to perform the leader election:

      ----- 
     | XXX | * DEAD
      ----- 

 ----       ----- 
| I2 |     | XXX | * DEAD
 ----       ----- 

The running instance then tries to acquire a distributed lock, but the request freezes executing the getLock method, causing the requests to queue for minutes (there is the need to configure the timeout when the instance become the only one in the subsystem).

I have also noticed the following log being printed forever:

2019-08-16 10:56:21.697  WARN 1337 --- [ration.thread-1] Impl$LeaderFailureDetectionTask(default) : [127.0.0.1]:5702 [dev] [3.12.1] We are FOLLOWER and there is no current leader. Will start new election round...
2019-08-16 10:56:23.737  WARN 1337 --- [cached.thread-8] c.h.nio.tcp.TcpIpConnectionErrorHandler  : [127.0.0.1]:5702 [dev] [3.12.1] Removing connection to endpoint [127.0.0.1]:5701 Cause => java.net.SocketException {Connection refused to address /127.0.0.1:5701}, Error-Count: 106
2019-08-16 10:56:23.927  WARN 1337 --- [ration.thread-1] Impl$LeaderFailureDetectionTask(default) : [127.0.0.1]:5702 [dev] [3.12.1] We are FOLLOWER and there is no current leader. Will start new election round...
2019-08-16 10:56:26.006  WARN 1337 --- [onMonitorThread] c.h.s.i.o.impl.Invocation                : [127.0.0.1]:5702 [dev] [3.12.1] Retrying invocation: Invocation{op=com.hazelcast.cp.internal.operation.ChangeRaftGroupMembershipOp{serviceName='hz:core:raft', identityHash=1295439737, partitionId=81, replicaIndex=0, callId=1468, invocationTime=1565963786004 (2019-08-16 10:56:26.004), waitTimeout=-1, callTimeout=60000, groupId=CPGroupId{name='default', seed=0, commitIndex=6}, membersCommitIndex=0, member=CPMember{uuid=4792972d-d430-48f5-93ed-cb0e1fd8aed2, address=[127.0.0.1]:5703}, membershipChangeMode=REMOVE}, tryCount=250, tryPauseMillis=500, invokeCount=130, callTimeoutMillis=60000, firstInvocationTimeMs=1565963740657, firstInvocationTime='2019-08-16 10:55:40.657', lastHeartbeatMillis=0, lastHeartbeatTime='1969-12-31 21:00:00.000', target=[127.0.0.1]:5701, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=null}, Reason: com.hazelcast.core.MemberLeftException: Member [127.0.0.1]:5702 - ab45ea09-c8c9-4f03-b3db-42b7b440d161 this has left cluster!
2019-08-16 10:56:26.232  WARN 1337 --- [cached.thread-8] c.h.nio.tcp.TcpIpConnectionErrorHandler  : [127.0.0.1]:5702 [dev] [3.12.1] Removing connection to endpoint [127.0.0.1]:5701 Cause => java.net.SocketException {Connection refused to address /127.0.0.1:5701}, Error-Count: 107
2019-08-16 10:56:26.413  WARN 1337 --- [ration.thread-1] Impl$LeaderFailureDetectionTask(default) : [127.0.0.1]:5702 [dev] [3.12.1] We are FOLLOWER and there is no current leader. Will start new election round...
2019-08-16 10:56:27.143  WARN 1337 --- [onMonitorThread] c.h.s.i.o.impl.Invocation                : [127.0.0.1]:5702 [dev] [3.12.1] Retrying invocation: Invocation{op=com.hazelcast.cp.internal.operation.ChangeRaftGroupMembershipOp{serviceName='hz:core:raft', identityHash=1295439737, partitionId=81, replicaIndex=0, callId=1479, invocationTime=1565963787142 (2019-08-16 10:56:27.142), waitTimeout=-1, callTimeout=60000, groupId=CPGroupId{name='default', seed=0, commitIndex=6}, membersCommitIndex=0, member=CPMember{uuid=4792972d-d430-48f5-93ed-cb0e1fd8aed2, address=[127.0.0.1]:5703}, membershipChangeMode=REMOVE}, tryCount=250, tryPauseMillis=500, invokeCount=140, callTimeoutMillis=60000, firstInvocationTimeMs=1565963740657, firstInvocationTime='2019-08-16 10:55:40.657', lastHeartbeatMillis=0, lastHeartbeatTime='1969-12-31 21:00:00.000', target=[127.0.0.1]:5703, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=null}, Reason: com.hazelcast.spi.exception.TargetNotMemberException: Not Member! target: CPMember{uuid=4792972d-d430-48f5-93ed-cb0e1fd8aed2, address=[127.0.0.1]:5703}, partitionId: 81, operation: com.hazelcast.cp.internal.operation.ChangeRaftGroupMembershipOp, service: hz:core:raft
2019-08-16 10:56:28.835  WARN 1337 --- [cached.thread-6] c.h.nio.tcp.TcpIpConnectionErrorHandler  : [127.0.0.1]:5702 [dev] [3.12.1] Removing connection to endpoint [127.0.0.1]:5701 Cause => java.net.SocketException {Connection refused to address /127.0.0.1:5701}, Error-Count: 108
2019-08-16 10:56:28.941  WARN 1337 --- [ration.thread-1] Impl$LeaderFailureDetectionTask(default) : [127.0.0.1]:5702 [dev] [3.12.1] We are FOLLOWER and there is no current leader. Will start new election round...
2019-08-16 10:56:31.038  WARN 1337 --- [cached.thread-3] c.h.nio.tcp.TcpIpConnectionErrorHandler  : [127.0.0.1]:5702 [dev] [3.12.1] Removing connection to endpoint [127.0.0.1]:5701 Cause => java.net.SocketException {Connection refused to address /127.0.0.1:5701}, Error-Count: 109
2019-08-16 10:56:31.533  WARN 1337 --- [ration.thread-1] Impl$LeaderFailureDetectionTask(default) : [127.0.0.1]:5702 [dev] [3.12.1] We are FOLLOWER and there is no current leader. Will start new election round...
2019-08-16 10:56:31.555  WARN 1337 --- [.async.thread-3] c.h.s.i.o.impl.Invocation                : [127.0.0.1]:5702 [dev] [3.12.1] Retrying invocation: Invocation{op=com.hazelcast.cp.internal.operation.ChangeRaftGroupMembershipOp{serviceName='hz:core:raft', identityHash=1295439737, partitionId=81, replicaIndex=0, callId=1493, invocationTime=1565963791554 (2019-08-16 10:56:31.554), waitTimeout=-1, callTimeout=60000, groupId=CPGroupId{name='default', seed=0, commitIndex=6}, membersCommitIndex=0, member=CPMember{uuid=4792972d-d430-48f5-93ed-cb0e1fd8aed2, address=[127.0.0.1]:5703}, membershipChangeMode=REMOVE}, tryCount=250, tryPauseMillis=500, invokeCount=150, callTimeoutMillis=60000, firstInvocationTimeMs=1565963740657, firstInvocationTime='2019-08-16 10:55:40.657', lastHeartbeatMillis=0, lastHeartbeatTime='1969-12-31 21:00:00.000', target=[127.0.0.1]:5702, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=null}, Reason: com.hazelcast.cp.exception.NotLeaderException: CPMember{uuid=ab45ea09-c8c9-4f03-b3db-42b7b440d161, address=[127.0.0.1]:5702} is not LEADER of CPGroupId{name='default', seed=0, commitIndex=6}. Known leader is: N/A

Is there a way to identify that the instance is now running alone, and if so, do not block the application during the acquisition of a new lock?

I was looking to some mechanism to not block the flow of the application in any way, even if the application is running alone I would use a regular j.u.c.l.ReentrantLock instead of the FencedLock.

Payee answered 16/8, 2019 at 14:19 Comment(0)
P
1

After a few days of testing, I came to the following conclusion:

  1. However CPSubsystem demands at least three modules to start working, it is fine to have two instances running
  2. In the most catastrophic possible scenario I presented (having just one instance running), there is nothing much to do, your environment probably is having a ruff time, some kind of intervention or attention will be needed to solve this interruption

I decided to prevent the request to be fulfilled in the case this scenario happens to keep consistency of all operations between modules.

This decision was made reading a lot of material (here, here, here, here, here and also simulating the scenario over here).

So the approach was the following:

try {
    if( !hz.isCpInstanceAvailable() ) {
        throw new HazelcastUnavailableException("CPSubsystem is not available");
    }
    ... acquires the lock ...
} catch (HazelcastUnavailableException e) {
    LOG.error("Error retrieving Hazelcast Distributed Lock :( Please check the CPSubsystem health among all instances", e);
    throw e;
}

The method isCpInstanceAvailable will perform three validations:

  1. If the current application is registered on the CPSubsystem
  2. If the CPSubsystem is up
  3. If there a minimum of members available in the CPSubsystem

So here is the solution:

protected boolean isCpInstanceAvailable() {
    try {
        return getCPLocalMember() != null && getCPMembers().get(getMemberValidationTimeout(), TimeUnit.SECONDS).size() > ONE_MEMBER;
    } catch (InterruptedException | ExecutionException | TimeoutException e) {
        LOG.error("Issue retrieving CP Members", e);
    }

    return false;
}

protected ICompletableFuture<Collection<CPMember>> getCPMembers() {
    return Optional.ofNullable(getCPSubsystemManagementService().getCPMembers()).orElseThrow(
            () -> new HazelcastUnavailableException("CP Members not available"));
}

protected CPMember getCPLocalMember() {
    return getCPSubsystemManagementService().getLocalCPMember();
}

Here comes the issue, simply calling getCPMembers().get() would cause the long pause I was experiencing (default timeout).

So I used the getCPMembers().get(getMemberValidationTimeout(), TimeUnit.SECONDS), which will throw an error if the call exceeds the expected timeout.

Payee answered 21/8, 2019 at 17:55 Comment(0)
D
1

CP Subsystem is designed to block all operations on the data structure that belong to CP subsystem family when there are not sufficient members available to form the CP subsystem in first place. This property is controlled by CPSubsystemConfig.setCPMemberCount(int)

hazelcastInstance.getCPSubsystem().getCPSubsystemManagementService().getCPMembers() will give you CP members in the cluster.

To determine cluster member count, you can use hazelcastInstance.getCluster().getMembers() and/or use MembershipListener for members joining or leaving events.

Domenic answered 21/8, 2019 at 5:50 Comment(1)
Hi there, I have solved the issue with some more handling on the getMembers call, the issue was exactly there, will answer the question explaining the approachPayee
P
1

After a few days of testing, I came to the following conclusion:

  1. However CPSubsystem demands at least three modules to start working, it is fine to have two instances running
  2. In the most catastrophic possible scenario I presented (having just one instance running), there is nothing much to do, your environment probably is having a ruff time, some kind of intervention or attention will be needed to solve this interruption

I decided to prevent the request to be fulfilled in the case this scenario happens to keep consistency of all operations between modules.

This decision was made reading a lot of material (here, here, here, here, here and also simulating the scenario over here).

So the approach was the following:

try {
    if( !hz.isCpInstanceAvailable() ) {
        throw new HazelcastUnavailableException("CPSubsystem is not available");
    }
    ... acquires the lock ...
} catch (HazelcastUnavailableException e) {
    LOG.error("Error retrieving Hazelcast Distributed Lock :( Please check the CPSubsystem health among all instances", e);
    throw e;
}

The method isCpInstanceAvailable will perform three validations:

  1. If the current application is registered on the CPSubsystem
  2. If the CPSubsystem is up
  3. If there a minimum of members available in the CPSubsystem

So here is the solution:

protected boolean isCpInstanceAvailable() {
    try {
        return getCPLocalMember() != null && getCPMembers().get(getMemberValidationTimeout(), TimeUnit.SECONDS).size() > ONE_MEMBER;
    } catch (InterruptedException | ExecutionException | TimeoutException e) {
        LOG.error("Issue retrieving CP Members", e);
    }

    return false;
}

protected ICompletableFuture<Collection<CPMember>> getCPMembers() {
    return Optional.ofNullable(getCPSubsystemManagementService().getCPMembers()).orElseThrow(
            () -> new HazelcastUnavailableException("CP Members not available"));
}

protected CPMember getCPLocalMember() {
    return getCPSubsystemManagementService().getLocalCPMember();
}

Here comes the issue, simply calling getCPMembers().get() would cause the long pause I was experiencing (default timeout).

So I used the getCPMembers().get(getMemberValidationTimeout(), TimeUnit.SECONDS), which will throw an error if the call exceeds the expected timeout.

Payee answered 21/8, 2019 at 17:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.