ZeroMQ doesn't auto-reconnect

E

3

8

I've just downloaded and installed zeromq-4.0.5 on an Unbutu Precise (12.04) system. I've compiled the hello-world client (REQ, connect, 127.0.0.1) and server (REP, bind) written in C.

I start the server.
I start the client.
Each second the client sends a message to the server, and receives a response.
I press Ctrl-C to stop the server.
The client tries to send its next outgoing message and it gets stuck in an never-returning epoll system call (as shown by strace).
I restart the server.
The zmq_recv call in the client is still stuck, even when the new server has been running for a minute. The only way to make progress for the client is to kill it (with Ctrl-C) and restart it.

Q1: Is this the expected behavior? I'd expect that in a few seconds the client should figure out that the server is running again, and it would auto-reconnect.

Q2: What should I change in the example code to fix this?

Q3: Am I using the wrong version of the software, or is something broken on my system?

I've disabled the firewall, sudo iptables -S prints -P INPUT ACCEPT; -P FORWARD ACCEPT; -P OUTPUT ACCEPT.

In the strace -f ./hwclient output I can see that the client is trying connect() 10 times a second (the default value of ZMQ_RECONNECT_IVL) after the server went down. On the strace -f ./hwserver output I can see that the restarted server accept()s the connection. However, communication gets stuck after that, and the server never receives the actual request from the client (but it notices when I kill the client; also the server receives requests from other clients which have been started after the server restart).

Using ipc:// instead of tcp:// causes the same behavior.

The auto-reconnect happens in successfully in zmq_send if the server has been killed before the client does the next zmq_send. However, when the server gets killed while the client is running zmq_recv, then the zmq_recv blocks indefinitely, and the client can't seem to recover from that.

I've found this article, which recommends using timeouts. However, I think that timeouts can't be the right solution, because the TCP disconnect notification is already available in the client process, and it's already acting on it -- it just doesn't make zmq_recv resend the request to the new server -- or at least return early indicating an error.

Euthanasia answered 24/10, 2014 at 22:59 Comment(4)

checking zmq_setsockopt and zmq_getsockopt may helps, there's some options of reconnecting. – Gulf 25/10, 2014 at 0:52

@raison: It looks like that the default value of ZMQ_RECONNECT_IVL in zmq_setsockopt (api.zeromq.org/4-0:zmq-setsockopt) has auto-reconnect enabled. What else should I change? – Euthanasia 25/10, 2014 at 9:9

ZeroMQ recommends to design all code so that it can gracefully exit & release all resources. SIG_KILL will not give much chance to .close() all ZMQ-Sockets & to .term() all process related ZMQ-Context thread(s), which historically caused nasty memory leaks and O/S zombies blocking ports and many production-grade environment troubles, if not handled with care. – Gallop 25/10, 2014 at 9:22

FYI I've just noticed this very long chapter about reliable REQ-REP: zguide.zeromq.org/page:all#reliable-request-reply – Euthanasia 25/10, 2014 at 10:12

N

4

You may having the same issue that zeromq just fixed for me in 4.0.6 (issue 1362). Basically, the subscriber socket wouldn't always resend it's filter back over during a reconnection (an empty filter means no messages from publisher to that subscriber). The only way to recover was to restart the client's application. Their fix seems to have done the job. The issue was really highlighted when using a transport (like stunnel) to tunnel the connections. Without 4.0.6, I was able to get around the issue by setting the "immediate" flag on the subscriber socket.

Neelon answered 24/2, 2015 at 0:26 Comment(0)

G

4

A3: No.

A2: Do not expect demo to have a design for fault-resilient operations

A1: Yes.

Where to go for more details?

A best next step you may do for this is IMHO to get a bit more global view, which may sound complicated for the first few things one tries to code with ZeroMQ, but if you at least jump to the page 265 of the Code Connected, Volume 1 [asPdf->], if it were not the case of reading step-by-step there.

The fastest-ever learning-curve would be to have first an un-exposed view on the Fig.60 Republishing Updates and Fig.62 HA Clone Server pair for a possible High-availability approach and then go back to the roots, elements and details. enter image description here

Gallop answered 25/10, 2014 at 5:53 Comment(0)

N

4

You may having the same issue that zeromq just fixed for me in 4.0.6 (issue 1362). Basically, the subscriber socket wouldn't always resend it's filter back over during a reconnection (an empty filter means no messages from publisher to that subscriber). The only way to recover was to restart the client's application. Their fix seems to have done the job. The issue was really highlighted when using a transport (like stunnel) to tunnel the connections. Without 4.0.6, I was able to get around the issue by setting the "immediate" flag on the subscriber socket.

Neelon answered 24/2, 2015 at 0:26 Comment(0)

A

2

REQ / REP communication solution

Set ZMQ_REQ_CORRELATE to 1 and ZMQ_REQ_RELAXED also to 1. It will definitely help. Use ZMQ version 4.2 and higher for these settings.

Here are the solution authors pages: improving-req-sockets-in-zqm-4

Read more in the manual: http://api.zeromq.org/4-2:zmq-setsockopt

Arleen answered 19/11, 2020 at 18:58 Comment(0)

A3: No.

A2: Do not expect demo to have a design for fault-resilient operations

A1: Yes.

Where to go for more details?

Recommended topics

Hot tags