Reconnect a Hazelcast Client
Asked Answered
I

2

9

We are connecting to an external Hazelcast cluster (version 3.7.2) using the Java Hazelcast client but are having issues reconnecting if the cluster goes down.

We are creating our client with HazelcastClient.newHazelcastClient. Once we do that, we are keeping a copy of the HazelcastInstance and using that to interact with the Hazelcast cluster (getMap, getSet, etc.). We are also storing the maps, sets, etc. that we get from the HazelcastInstance in potentially long lived objects. Everything works fine in the happy path. However, if the cluster ever goes down and comes back up, we get HazelcastInstanceNotActiveException when trying to access these objects that were created prior to the cluster going down.

Is there a way to automatically re-establish the client connection when the cluster comes back online so we can resume using use the objects (maps, sets, etc.) we'd previously retrieved from Hazelcast before the cluster went down? Or do we need to have additional code to catch HazelcastInstanceNotActiveException and then rebuild the HazelcastInstance and any objects we have stored in the client application? The latter seems like it will be quite invasive and definitely not desirable to deal with in each instance we store one of these Hazelcast objects.

Most of the things I've read refer to the NetworkConfig settings for connection timeout, attempt limit, and attempt timeout. We are currently using the default values but they do not seem to do anything when accessing an object we've already retrieved. Any access to a previously existing object immediately fails with HazelcastInstanceNotActiveException even after the cluster is back up.

This seems like a common problem many people would run into. What is the best practice for dealing with this?

Immoderation answered 18/1, 2017 at 6:56 Comment(0)
E
1

As you already read setting the value of connection attempts to Integer.MAX_VALUE and making the duration between attempts higher is where you're heading to.

At the moment there's no other way to solve this issue. I imagine a minimalist SPI to provide custom strategies on how to handle reconnection, like exponential back-off but such a thing doesn't exist yet.

Entangle answered 18/1, 2017 at 9:12 Comment(3)
If I increase the connection attempts to Integer.MAX_VALUE, it will block indefinitely while the cluster is down. We have a fairly high traffic load so we can't queue up all of our requests until the cluster is back up. The HazelcastInstance knows if it's running (getLifecycleService().isRunning()) so it would be nice if it would fail immediately if it's not running but then continually try to reconnect in the background based on some predefined strategy (try every X number of seconds, exponential back-off, etc.). Is anything like this possible?Immoderation
Please create a feature request in github because at the moment the methods are designed to block, so what you're asking for makes perfect sense but is a massive change from the current approach. You might want to look at the async ops and kill them after a timeout?Entangle
Thanks for your suggestions. I submitted a feature request here: github.com/hazelcast/hazelcast/issues/9692Immoderation
P
4

In Hazelcast 3.11 has been released the exponential backoff client reconnect strategy: https://docs.hazelcast.org/docs/latest/manual/html-single/#configuring-client-connection-retry.

<hazelcast-client>
  ...
   <connection-strategy async-start="false" reconnect-mode="ON">
        <connection-retry enabled="true">
            <initial-backoff-millis>1000</initial-backoff-millis>
            <max-backoff-millis>60000</max-backoff-millis>
            <multiplier>2</multiplier>
            <fail-on-max-backoff>true</fail-on-max-backoff>
            <jitter>0.5</jitter>
        </connection-retry>
   </connection-strategy>
  ...
</hazelcast-client>
Parenthood answered 20/2, 2019 at 8:18 Comment(0)
E
1

As you already read setting the value of connection attempts to Integer.MAX_VALUE and making the duration between attempts higher is where you're heading to.

At the moment there's no other way to solve this issue. I imagine a minimalist SPI to provide custom strategies on how to handle reconnection, like exponential back-off but such a thing doesn't exist yet.

Entangle answered 18/1, 2017 at 9:12 Comment(3)
If I increase the connection attempts to Integer.MAX_VALUE, it will block indefinitely while the cluster is down. We have a fairly high traffic load so we can't queue up all of our requests until the cluster is back up. The HazelcastInstance knows if it's running (getLifecycleService().isRunning()) so it would be nice if it would fail immediately if it's not running but then continually try to reconnect in the background based on some predefined strategy (try every X number of seconds, exponential back-off, etc.). Is anything like this possible?Immoderation
Please create a feature request in github because at the moment the methods are designed to block, so what you're asking for makes perfect sense but is a massive change from the current approach. You might want to look at the async ops and kill them after a timeout?Entangle
Thanks for your suggestions. I submitted a feature request here: github.com/hazelcast/hazelcast/issues/9692Immoderation

© 2022 - 2024 — McMap. All rights reserved.