zookeeper server not running
Asked Answered
Q

4

8

I'm trying to start an hbase master from ambari.

It can't start it because it can't connect to zookeper server.

Ambari marks all the zookeper servers (3 nodes) as running.

The application server (tomcat¿?) that runs the zookeper server application seems to be running fine; At least there is a service listening on the specified port.

But the application is not able to connect to the other nodes and it seems like it doesn't start.

All the connections are closed with the error message ZooKeeperServer not running on zookeeper server log, and zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x0, likely server has closed socket on the client.

This is the zookeper server log output for those nodes (same log for all of them, only the node names change):

2016-03-31 16:15:34,550 - INFO  [main:QuorumPeerConfig@103] - Reading configuration from: /usr/hdp/current/zookeeper-server/conf/zoo.cfg
2016-03-31 16:15:34,553 - INFO  [main:QuorumPeerConfig@338] - Defaulting to majority quorums
2016-03-31 16:15:34,557 - INFO  [main:DatadirCleanupManager@78] - autopurge.snapRetainCount set to 30
2016-03-31 16:15:34,557 - INFO  [main:DatadirCleanupManager@79] - autopurge.purgeInterval set to 24
2016-03-31 16:15:34,558 - INFO  [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started.
2016-03-31 16:15:34,565 - INFO  [PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed.
2016-03-31 16:15:34,566 - INFO  [main:QuorumPeerMain@127] - Starting quorum peer
2016-03-31 16:15:34,573 - INFO  [main:NIOServerCnxnFactory@94] - binding to port 0.0.0.0/0.0.0.0:2181
2016-03-31 16:15:34,582 - INFO  [main:QuorumPeer@992] - tickTime set to 2000
2016-03-31 16:15:34,582 - INFO  [main:QuorumPeer@1012] - minSessionTimeout set to -1
2016-03-31 16:15:34,582 - INFO  [main:QuorumPeer@1023] - maxSessionTimeout set to -1
2016-03-31 16:15:34,582 - INFO  [main:QuorumPeer@1038] - initLimit set to 10
2016-03-31 16:15:34,598 - INFO  [Thread-2:QuorumCnxManager$Listener@506] - My election bind port: sg1.imatiasl.lan/127.0.0.1:3888
2016-03-31 16:15:34,607 - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumPeer@747] - LOOKING
2016-03-31 16:15:34,608 - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@815] - New election. My id =  1, proposed zxid=0x0
2016-03-31 16:15:34,609 - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@597] - Notification: 1 (message format version), 1 (n.leader), 0x0 (n.zxid), 0x1 (
n.round), LOOKING (n.state), 1 (n.sid), 0x0 (n.peerEpoch) LOOKING (my state)
2016-03-31 16:15:34,612 - WARN  [WorkerSender[myid=1]:QuorumCnxManager@383] - Cannot open channel to 2 at election address sg2.imatiasl.lan/10.7.0.93:3888
java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:368)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:341)
        at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:449)
        at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:430)
        at java.lang.Thread.run(Thread.java:745)
2016-03-31 16:15:34,614 - WARN  [WorkerSender[myid=1]:QuorumCnxManager@383] - Cannot open channel to 3 at election address sg3.imatiasl.lan/10.7.0.94:3888
java.net.ConnectException: Conexión rehusada
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:368)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:341)
        at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:449)
        at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:430)
        at java.lang.Thread.run(Thread.java:745)
2016-03-31 16:15:34,812 - WARN  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@383] - Cannot open channel to 2 at election address sg2.imatiasl.la
n/10.7.0.93:3888
java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:368)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:404)
        at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:840)
        at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:795)
2016-03-31 16:15:34,813 - WARN  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@383] - Cannot open channel to 3 at election address sg3.imatiasl.la
n/10.7.0.94:3888
java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:368)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:404)
        at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:840)
        at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:795)
2016-03-31 16:15:34,813 - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@849] - Notification time out: 400

When the client tries to connect:

2016-03-31 16:15:35,086 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /10.7.0.93:55914
2016-03-31 16:15:35,130 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due to java.io.IOExcep
tion: ZooKeeperServer not running
2016-03-31 16:15:35,130 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /10.7.0.93:55914 (no ses
sion established for client)

And so on...

Any ideas on how to fix this?

Quaternion answered 1/4, 2016 at 16:44 Comment(0)
C
5

Your election port is being binded at sgX.imatiasl.lan/127.0.0.1:3888 for all nodes, so when the clients try to connect to sgY.imatiasl.lan/10.7.0.93:3888 it fails.

The election ports should bind to 0.0.0.0:3888 or the real IP of each node, but for some reason they are being resolved to 127.0.0.1. You can check the IP:port in each node with netstat -patun to confirm this.

Much probably you have some issue with /etc/hosts. Take a look at: https://unix.stackexchange.com/questions/240506/zookeeper-dns-name-problems-with-leader-elections-when-migrating-from-windows-to

Coffle answered 4/4, 2016 at 9:30 Comment(5)
Thank you! I was binding sgX to 127.0.0.1 in /etc/hosts of each sgX (which incidentally I did to solve some problem during the cluster setup I don't even remember), and that was the problem.Quaternion
Now I can start zookeeper and hbase RegionServer without problems, but HBase master is stil resisting. I get UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 58: ordinal not in range(128) on checked_call['curl -sS -L -w '%{http_code}' -X GET 'http://sg1.imatiasl.lan:50070/webhdfs/v1/apps/hbase/data?op=GETFILESTATUS&user.name=hdfs''] {'logoutput': None, 'user': 'hdfs', 'stderr': -1, 'quiet': False}. Do you know how I can solve this?Quaternion
(posted another question here: #36409605 )Quaternion
I don't use Ambari and I don't know python, but that seems to be related to 'ñ' and tildes. Try removing them wherever they are, or use unicode u'..' strings. I can't deduce much more :( salvatorelab.es/2013/12/…Coffle
as NotGael said, be aware that if you have bind some hostname to ip like 127.0.1.1 in /etc/hosts, even with multiple servers, Zookeeper will fail establishing server 2 server connexionDarceldarcey
J
1

I faced a similar issue in a Kubernetes environment. Although I couldn't find a proper explanation for why it occurred, restarting all the ZooKeeper instances at once resolved the problem for me.

If someone has better insights, I would be glad to hear them.

Jarodjarosite answered 8/3, 2023 at 19:15 Comment(0)
J
0

2016-03-31 16:15:34,813 - WARN [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@383] - Cannot open channel to 3 at election address sg3.imatiasl.la One of your nodes is refusing connections.

Junette answered 2/4, 2016 at 22:44 Comment(1)
all of my nodes are refusing connections. This one cannot connect to the other two, and those two cannot connect to this one or each other. It happens when I try to start zookeeper. Ambari says it's starting right but, as you can see on the logs, it's not. Do you know why?Quaternion
S
0

Try to execute "jps" command on your nodes and see if zookeeper service is up, if not start it.

Singlephase answered 3/4, 2016 at 11:46 Comment(2)
I ran jps -l as the user zookeeper and got 612 org.apache.zookeeper.server.quorum.QuorumPeerMain, so I guess it is. What can I do now?Quaternion
can you reboot? or kill the processes? I think you wished to run just one service of zookeeper, do you run any script to init ZK that maybe has some bad loop inside?Singlephase

© 2022 - 2024 — McMap. All rights reserved.