Zookeeper error: Cannot open channel to X at election address
Asked Answered
B

14

59

I have installed zookeeper in 3 different aws servers. The following is the configuration in all the servers

tickTime=2000
initLimit=10
syncLimit=5
dataDir=/var/zookeeper
clientPort=2181
server.1=x.x.x.x:2888:3888
server.2=x.x.x.x:2888:3888
server.3=x.x.x.x:2888:3888

All the three instance have a myid file at var/zookeeper with appropriate id in it. All the three servers have all ports open from the aws console. But when I run the zookeeper server, I get the following error in all the instances.

2015-06-19 12:09:22,989 [myid:1] - WARN  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@382] 
  - Cannot open channel to 2 at election address /x.x.x.x:3888
java.net.ConnectException: Connection refused
  at java.net.PlainSocketImpl.socketConnect(Native Method)
  at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
  at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
  at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
  at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
  at java.net.Socket.connect(Socket.java:579)
  at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:368)
  at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:402)
  at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:840)
  at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:762)
2015-06-19 12:09:23,170 [myid:1] - WARN  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@382]
   - Cannot open channel to 3 at election address /x.x.x.x:3888
java.net.ConnectException: Connection refused
  at java.net.PlainSocketImpl.socketConnect(Native Method)
  at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
  at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
  at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
  at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
  at java.net.Socket.connect(Socket.java:579)
  at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:368)
  at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:402)
  at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:840)
  at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:762)
2015-06-19 12:09:23,170 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - Notification time out: 25600
Burgos answered 19/6, 2015 at 14:42 Comment(4)
Did you make sure zookeeper server started fine on all three nodes?Epicureanism
@NitinArora Yes. I did start them in all the 3 nodes. First server was throwing an error as cannot connect to 2,3. Second server was throwing an error as cannot connect to 1,3 and Third server respectively.Burgos
That's a warn not an error? It seems 1st node try to talk to other nodes. I think it's normal!Porte
I fixed this issue by using fully qualified host names in /etc/zookeeper/conf_example/zoo.cfg instead of ip addresses and then allowed all traffic on 2888 and 3888 ports using ufwZita
O
113

How have defined the ip of the local server in each node? If you have given the public ip, then the listener would have failed to connect to the port. You must specify 0.0.0.0 for the current node

server.1=0.0.0.0:2888:3888
server.2=192.168.10.10:2888:3888
server.3=192.168.2.1:2888:3888

This change must be performed at the other nodes too.

Oenomel answered 23/6, 2015 at 3:2 Comment(11)
Hello, I am facing the same issue. But this solution didn't work. Please help. Below is my config... dataDir=/tmp/zookeeper clientPort=3888 maxClientCnxns=0 initLimit=5 syncLimit=2 server.1=x.x.x.x:2888:3888 server.2=0.0.0.0:2888:3888Maintain
you have the maximum number of client connections set to 0: maxClientCnxns=0Bysshe
thank you. A shame this isnt mentioned in the official getting started guideGogetter
Could you flesh out this answer some more? 0.0.0.0 makes zookeeper listen on all IPs. What if I only want it to listen on one?Shakitashako
Why... the heck... isn't this part of the official documentation. us poor saps have to find this ourselvesDannie
@ady's comment on Shades88's issue is incorrect, setting maxClientCnxns to 0 means there are no limits to the concurrent client connections. This setting is there to prevent DoS attacks.Athos
In my case iptables service was running, after service stop zookeeper started in server mode without error. No such need of 0.0.0.0...Grisgris
Is there any official documentation which verifies this?Superclass
@Oenomel May you PLEASE shed some light on this issue? Why happen? Why is not automatically handle by Ambari? You have basically save me. I have done already 5 cluster installations and this solved it! :)Cyrenaic
PLEASE DON'T USE 0.0.0.0 with zookeeper versions between 3.5.0. and 3.5.8. It kills zookeeper nodes after first restart. There is a confirmed bug in Zookeeper which was fixed only since 3.8.0. However, confluent-platform, for instance, currently using 3.7.0. Commit with description @Oenomel i think it worth noting this in the answer.Marlin
@AntonSmolkov is right. Do not change Zookeeper config, just look in the /etc/hosts and make sure your host does not resolve to 127.0.0.1. For more info check this https://mcmap.net/q/327411/-zookeeper-error-cannot-open-channel-to-x-at-election-address - I think it's the best answer here.Insufferable
R
14

I met the save question and solved it.

make sure the myid is the save with your configuration in the zoo.cfg.

please check your zoo.cfg file in your conf directory, which contains such content.

server.1=zookeeper1:2888:3888  
server.2=zookeeper2:2888:3888  
server.3=zookeeper3:2888:3888  

and check the myid in your server dataDir directory. For example:

let's say the dataDir defined on the zoo.cfg is '/home/admin/data'

then on zookeeper1, you must have a file named myid and have value 1 on this file ;on zookeeper2, you must have a file named myid and have value 2 on this file; on zookeeper3, you must have a file named myid and have value 3 on this file.

if not configured like this, the server will listen on a wrong ip:port.

Represent answered 16/2, 2017 at 7:10 Comment(1)
this solution worked for me. Add unique Id for each Zookeeper serrver at <dataDir>/myid file. zookeeper1:<dataDir>/myid =1 , zookeeper2:<dataDir>/myid = 2, zookeeper3:<dataDir>/myid =3Octavie
M
6

Here is some ansible jinja2 template info for automating the build of a cluster with the 0.0.0.0 hostname in zoo.cfg

{% for url in zookeeper_hosts_list %}
  {%- set url_host = url.split(':')[0] -%}
  {%- if url_host == ansible_fqdn or url_host in     ansible_all_ipv4_addresses -%}
server.{{loop.index0}}=0.0.0.0:2888:3888
{% else %}
server.{{loop.index0}}={{url_host}}:2888:3888
{% endif %}
{% endfor %}
Misprision answered 4/10, 2016 at 18:28 Comment(1)
Thanks! I added a if url_host == inventory_hostname as well for machines accessed through NAT or similar (i.e. IPs gathered by the setup module might not be the same as the external IP).Procto
P
6

If your own hostname resolves to 127.0.0.1 (In my case, the hostname was in /etc/hosts), zookeeper won't start up without having 0.0.0.0 in the zoo.cfg file, but if your hostname resolves to the actual machine's IP, you can put it's own hostname in the config file.

Pavid answered 22/11, 2016 at 20:20 Comment(2)
Same here, I had something like 127.0.1.1 zookeeper1 in my /etc/hosts. After removing it, Zookeeper started successfully.Castrate
I think this is the best answer here. So, no need to replace any IP addresses to 0.0.0.0.Insufferable
O
4

This is what worked for me

Step 1:
Node 1:
zoo.cfg
server.1= 0.0.0.0:<port>:<port2>
server.2= <IP>:<port>:<port2>
.
.
.
server.n= <IP>:<port>:<port2>

Node 2 :
server.1= <IP>:<port>:<port2>
server.2= 0.0.0.0:<port>:<port2>
.
.
.
server.n= <IP>:<port>:<port2>


Now in location defined by datadir on your zoo.cfg
Node 1:
echo 1 > <datadir>/id

Node 2:
echo 2 > <datadir>/id

.
.
.


Node n:
echo n > <datadir>/id

This one helped me to start zoo keeper successfully but will know more once i start playing with it. Hope this helps.

Oralle answered 16/12, 2015 at 18:49 Comment(0)
P
3

Had similar issues on a 3-Node zookeeper ensemble. Solution was as advised by espeirasbora and restarted.

So this was what I did

zookeeper1,zookeeper2 and zookeeper3

A. Issue :: znodes in my ensemble could not start

B. System SetUp :: 3 Znodes in three 3 machines

C. Error::

In my zookeper log file I could see the following errors

2016-06-26 14:10:17,484 [myid:1] - WARN  [SyncThread:1:FileTxnLog@334] - fsync-ing the write ahead log in SyncThread:1 took 1340ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
2016-06-26 14:10:17,847 [myid:1] - WARN  [RecvWorker:2:QuorumCnxManager$RecvWorker@810] - Connection broken for id 2, my id = 1, error = 
java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:795)
2016-06-26 14:10:17,848 [myid:1] - WARN  [RecvWorker:2:QuorumCnxManager$RecvWorker@813] - Interrupting SendWorker
2016-06-26 14:10:17,849 [myid:1] - WARN  [SendWorker:2:QuorumCnxManager$SendWorker@727] - Interrupted while waiting for message on queue
java.lang.InterruptedException
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
    at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
    at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:879)
    at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:65)
    at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:715)
2016-06-26 14:10:17,851 [myid:1] - WARN  [SendWorker:2:QuorumCnxManager$SendWorker@736] - Send worker leaving thread
2016-06-26 14:10:17,852 [myid:1] - WARN  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when following the leader
java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
    at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
    at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
    at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
    at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
    at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:846)
2016-06-26 14:10:17,854 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
java.lang.Exception: shutdown Follower

D. Actions & Resolution ::

On each znode a. I modified the configuration file $ZOOKEEPER_HOME/conf/zoo.cfg to set the machines IP to "0.0.0.0" while maintaining the IP addressof the other 2 znodes. b. restarted the znode c. checked the status d.Voila , I was ok

See below

-------------------------------------------------

on Zookeeper1

#Before modification 
[zookeeper1]$ tail -3   $ZOOKEEPER_HOME/conf/zoo.cfg 
server.1=zookeeper1:2888:3888
server.2=zookeeper2:2888:3888
server.3=zookeeper3:2888:3888

#After  modification 
[zookeeper1]$ tail -3  $ZOOKEEPER_HOME/conf/zoo.cfg 
server.1=0.0.0.0:2888:3888
server.2=zookeeper2:2888:3888
server.3=zookeeper3:2888:3888

#Start the Zookeper (Stop and STart or restart )
[zookeeper1]$ $ZOOKEEPER_HOME/bin/zkServer.sh  start
ZooKeeper JMX enabled by default
ZooKeeper remote JMX Port set to 52128
ZooKeeper remote JMX authenticate set to false
ZooKeeper remote JMX ssl set to false
ZooKeeper remote JMX log4j set to true
Using config: /opt/zookeeper-3.4.8/bin/../conf/zoo.cfg
Mode: follower

[zookeeper1]$ $ZOOKEEPER_HOME/bin/zkServer.sh  status
ZooKeeper JMX enabled by default
ZooKeeper remote JMX Port set to 52128
ZooKeeper remote JMX authenticate set to false
ZooKeeper remote JMX ssl set to false
ZooKeeper remote JMX log4j set to true
Using config: /opt/zookeeper-3.4.8/bin/../conf/zoo.cfg
Mode: follower

---------------------------------------------------------

on Zookeeper2

#Before modification 
[zookeeper2]$ tail -3   $ZOOKEEPER_HOME/conf/zoo.cfg 
server.1=zookeeper1:2888:3888
server.2=zookeeper2:2888:3888
server.3=zookeeper3:2888:3888

#After  modification 
[zookeeper2]$ tail -3  $ZOOKEEPER_HOME/conf/zoo.cfg 
server.1=zookeeper1:2888:3888
server.2=0.0.0.0:2888:3888
server.3=zookeeper3:2888:3888

#Start the Zookeper (Stop and STart or restart )
[zookeeper2]$ $ZOOKEEPER_HOME/bin/zkServer.sh  start
ZooKeeper JMX enabled by default
ZooKeeper remote JMX Port set to 52128
ZooKeeper remote JMX authenticate set to false
ZooKeeper remote JMX ssl set to false
ZooKeeper remote JMX log4j set to true
Using config: /opt/zookeeper-3.4.8/bin/../conf/zoo.cfg
Mode: follower

[zookeeper2]$ $ZOOKEEPER_HOME/bin/zkServer.sh  status
ZooKeeper JMX enabled by default
ZooKeeper remote JMX Port set to 52128
ZooKeeper remote JMX authenticate set to false
ZooKeeper remote JMX ssl set to false
ZooKeeper remote JMX log4j set to true
Using config: /opt/zookeeper-3.4.8/bin/../conf/zoo.cfg
Mode: follower

---------------------------------------------------------

on Zookeeper3

#Before modification 
[zookeeper3]$ tail -3   $ZOOKEEPER_HOME/conf/zoo.cfg 
server.1=zookeeper1:2888:3888
server.2=zookeeper2:2888:3888
server.3=zookeeper3:2888:3888

#After  modification 
[zookeeper3]$ tail -3  $ZOOKEEPER_HOME/conf/zoo.cfg 
server.1=zookeeper1:2888:3888
server.2=zookeeper2:2888:3888
server.3=0.0.0.0:2888:3888

#Start the Zookeper (Stop and STart or restart )
[zookeeper3]$ $ZOOKEEPER_HOME/bin/zkServer.sh  start
ZooKeeper JMX enabled by default
ZooKeeper remote JMX Port set to 52128
ZooKeeper remote JMX authenticate set to false
ZooKeeper remote JMX ssl set to false
ZooKeeper remote JMX log4j set to true
Using config: /opt/zookeeper-3.4.8/bin/../conf/zoo.cfg
Mode: follower

[zookeeper3]$ $ZOOKEEPER_HOME/bin/zkServer.sh  status
ZooKeeper JMX enabled by default
ZooKeeper remote JMX Port set to 52128
ZooKeeper remote JMX authenticate set to false
ZooKeeper remote JMX ssl set to false
ZooKeeper remote JMX log4j set to true
Using config: /opt/zookeeper-3.4.8/bin/../conf/zoo.cfg
Mode: follower
Photomultiplier answered 26/6, 2016 at 13:27 Comment(0)
C
1

In mycase, the issue was, I had to start all the three zookeeper servers, Only then I was able to connect to zookeeper server using ./zkCli.sh

Cupp answered 28/2, 2018 at 12:55 Comment(0)
D
1

Adding additional info regarding Zookeeper clustering inside Amazon's VPC. Solution with '0.0.0.0' is working in case of Zookeeper running directly inside EC2 instance, in case when you are using docker '0.0.0.0' will not work properly with Zookeeper 3.5.X after node restart.

The issue lies in resolving '0.0.0.0' and ensemble sharing of node addresses and SID order (if you will start your nodes in descending order, this issue may not occur).

So far the only working solution is to upgrade to 3.6.2+ version.

Dreher answered 1/3, 2021 at 13:52 Comment(0)
P
0

We faced the same issue , for our case the root cause of the problem is too-many number of client connections . The default ulimit on aws ec2 instance is 1024 and this causes zookeeper nodes not able to communicate with each other .

The fix for this is change the ulimit to a higher number -> (> ulimit -n 20000 ) stop and start zookeeper.

Pyrolysis answered 4/6, 2019 at 18:2 Comment(0)
M
0

I had a similar issue. The status on 2 of my three zookeeper nodes was listed as "standalone", even though the zoo.cfg file indicated that it should be clustered. My third node couldn't start, with the error you described. I think what fixed it for me was running zkServer.sh start in quick succession across my three nodes, such that zookeeper was running before the zoo.cfg initLimit was reached. Hope this works for someone out there.

Munro answered 1/7, 2019 at 19:54 Comment(0)
B
0

I had the same error log, in my case, i use hostname of my node in zookeeper.conf.

My nodes were on virtual machine in Centos 8.

Like @user2286693 said, my mistake was the resolution mechanism:

Since node1, when I ping node1:

PING node1(localhost (::1)) 56 data bytes

I check my /etc/hosts file and I find:

127.0.0.1   localhost localhost.localdomain localhost4 
localhost4.localdomain4 node1

I replace this line by:

127.0.0.1   localhost localhost.localdomain localhost4 
localhost4.localdomain4

and it's working!

Hope this help someone!

Bowerbird answered 23/7, 2020 at 7:44 Comment(0)
D
0

When you are having this issue, you will see something like this:

org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
  at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[zookeeper-3.4.6.jar:3.4.6-1569965]

This indicates a network communication issue with Zookeeper is the cause.

How to fix it

Scale down zk to 0. Then scale back up to 3. Wait for them to all show ready.

Now go to zk-0 oc rsh zk-0 and run this command:

/opt/fusion/bin/zookeeper-client
Connecting to zk-0.zk:9983,zk-1.zk:9983,zk-2.zk:9983

(--- paused for a moment here ---)

Welcome to ZooKeeper!
JLine support is enabled

[zk: zk-0.zk:9983,zk-1.zk:9983,zk-2.zk:9983(CONNECTING) 0] 

Notice how it still says "CONNECTING". This means you did not get a successful communication with zookeeper.

You will see this in the /opt/fusion/var/log/zookeeper/zookeeper.log when this happens:

2021-04-17T00:45:52,848 - WARN  [WorkerSender[myid=1]:QuorumCnxManager@584] - Cannot open channel to 2 at election address zk-2.zk:3888
java.net.UnknownHostException: zk-2.zk
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184) ~[?:1.8.0_262]
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_262]
        at java.net.Socket.connect(Socket.java:607) ~[?:1.8.0_262]
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:558) [zookeeper-3.4.13.jar:3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03]
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:534) [zookeeper-3.4.13.jar:3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03]
        at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:454) [zookeeper-3.4.13.jar:3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03]
        at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:435) [zookeeper-3.4.13.jar:3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]

This is actually the infamous "No route to host exception" that we get on our OpenShift pods once and a while. When this is happening, zookeeper will show Ready but it is not able to communicate with the other zookeepers, so it's actually not ready in a sense.

How to fix this then?

Scale zk statefulset to 0, then back up to 3 again.

And repeat until you get a successful connection:

/opt/fusion/bin/zookeeper-client
Connecting to zk-0.zk:9983,zk-1.zk:9983,zk-2.zk:9983
Welcome to ZooKeeper!
JLine support is enabled

WATCHER::

WatchedEvent state:SyncConnected type:None path:null
[zk: zk-0.zk:9983,zk-1.zk:9983,zk-2.zk:9983(CONNECTED) 0]

Notice the CONNECTED

Now you can restart the rest of your services that depend on zk.

Dannie answered 17/4, 2021 at 1:38 Comment(0)
A
0

Do follows:

  1. Make sure the servers can connect to each other
  2. Make sure the port 8080 and 2181 are free to use
  3. Make sure you have the correct myid file in the servers, other answers provide more details
  4. Close the firewall
  5. Specify 0.0.0.0 for the current node, other answers provide more details

If you are sure that all 5 items above are correct, and still get Cannot open channel to x at election address. Restart zookeeper: zkServer.sh restart, then it works, really weird.

Alumna answered 4/2 at 23:36 Comment(0)
T
-1

I got the same because the quorum server port 3181 was still used by another service - changing the port fixed it

Trilley answered 3/10, 2021 at 13:3 Comment(3)
If the port is already in use you will not get a connection refused error but java.net.BindException Port in useProvocative
Yes you're right, but "cannot open channel to x at election address" will thrown tooTrilley
That is the error message, not the thrown error. It does not have a different message set for every different Error thrown. The thrown error in this case is java.net.ConnectException: Connection refusedProvocative

© 2022 - 2024 — McMap. All rights reserved.