CuratorFrameworkImpl - Background exception was not retry-able or retry gave up
Asked Answered
C

2

6

Curator framework version - 4.3.0, Zookeeper version - 5.5.0

We have deployed apache atlas on Kubernetes and it uses Zookeeper to elect one out of two atlas pods as a leader. We are running three zookeeper pods (3 node cluster) and one pod going down should not create any issue. When one zookeeper pod is down, the zookeeper cluster is still healthy and there is one zookeeper leader available. I tested this by exec'ing into a zookeeper pod and checking zookeeper status. But curator framework throws the following error -

[main:] ~ Background exception was not retry-able or retry gave up (CuratorFrameworkImpl:685)
java.net.UnknownHostException: zookeeper-2.zookeeper-headless.atlas.svc.cluster.local: Name or service not known
    at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
    at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
    at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
    at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
    at java.net.InetAddress.getAllByName(InetAddress.java:1193)
    at java.net.InetAddress.getAllByName(InetAddress.java:1127)
    at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
    at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
    at org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29)
    at org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:196)
    at org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:101)
    at org.apache.curator.HandleHolder.getZooKeeper(HandleHolder.java:57)
    at org.apache.curator.ConnectionState.reset(ConnectionState.java:201)
    at org.apache.curator.ConnectionState.start(ConnectionState.java:111)
    at org.apache.curator.CuratorZookeeperClient.start(CuratorZookeeperClient.java:214)
    at org.apache.curator.framework.imps.CuratorFrameworkImpl.start(CuratorFrameworkImpl.java:314)
    at org.apache.atlas.web.service.CuratorFactory.initializeCuratorFramework(CuratorFactory.java:88)
    at org.apache.atlas.web.service.CuratorFactory.<init>(CuratorFactory.java:78)
    at org.apache.atlas.web.service.CuratorFactory.<init>(CuratorFactory.java:73)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.springframework.beans.BeanUtils.instantiateClass(BeanUtils.java:142)
    at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:89)
    at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.instantiateBean(AbstractAutowireCapableBeanFactory.java:1152)

zookeeperConnectionString = "zookeeper-0.zookeeper-headless.atlas.svc.cluster.local:2181,zookeeper-1.zookeeper-headless.atlas.svc.cluster.local:2181,zookeeper-2.zookeeper-headless.atlas.svc.cluster.local:2181"

and the problem we are facing is, when we try to run leaderLatch.start() it does not return any error but the corresponding znode is not created in zookeeper.

Claudell answered 17/1, 2022 at 12:37 Comment(0)
P
0

The reason you see that error is on Kubernetes when a pod is restarted its DNS record also gets removed for a short time until the pod comes up again. In your case, there will not be an issue cause curator will connect to another ZK server in your CS.

Perfervid answered 14/9, 2022 at 16:19 Comment(0)
G
0

the DNS record review/cleanup did not help on the same issue.

I'd recommend you which helped me https://github.com/apache/shardingsphere/issues/19079

Gaffrigged answered 12/3, 2023 at 7:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.