Curator framework version - 4.3.0, Zookeeper version - 5.5.0
We have deployed apache atlas on Kubernetes and it uses Zookeeper to elect one out of two atlas pods as a leader. We are running three zookeeper pods (3 node cluster) and one pod going down should not create any issue. When one zookeeper pod is down, the zookeeper cluster is still healthy and there is one zookeeper leader available. I tested this by exec'ing into a zookeeper pod and checking zookeeper status. But curator framework throws the following error -
[main:] ~ Background exception was not retry-able or retry gave up (CuratorFrameworkImpl:685)
java.net.UnknownHostException: zookeeper-2.zookeeper-headless.atlas.svc.cluster.local: Name or service not known
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
at java.net.InetAddress.getAllByName(InetAddress.java:1193)
at java.net.InetAddress.getAllByName(InetAddress.java:1127)
at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
at org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29)
at org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:196)
at org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:101)
at org.apache.curator.HandleHolder.getZooKeeper(HandleHolder.java:57)
at org.apache.curator.ConnectionState.reset(ConnectionState.java:201)
at org.apache.curator.ConnectionState.start(ConnectionState.java:111)
at org.apache.curator.CuratorZookeeperClient.start(CuratorZookeeperClient.java:214)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.start(CuratorFrameworkImpl.java:314)
at org.apache.atlas.web.service.CuratorFactory.initializeCuratorFramework(CuratorFactory.java:88)
at org.apache.atlas.web.service.CuratorFactory.<init>(CuratorFactory.java:78)
at org.apache.atlas.web.service.CuratorFactory.<init>(CuratorFactory.java:73)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.springframework.beans.BeanUtils.instantiateClass(BeanUtils.java:142)
at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:89)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.instantiateBean(AbstractAutowireCapableBeanFactory.java:1152)
zookeeperConnectionString = "zookeeper-0.zookeeper-headless.atlas.svc.cluster.local:2181,zookeeper-1.zookeeper-headless.atlas.svc.cluster.local:2181,zookeeper-2.zookeeper-headless.atlas.svc.cluster.local:2181"
and the problem we are facing is, when we try to run leaderLatch.start() it does not return any error but the corresponding znode is not created in zookeeper.