I've just downloaded and installed zeromq-4.0.5 on an Unbutu Precise (12.04) system. I've compiled the hello-world client (REQ
, connect, 127.0.0.1) and server (REP
, bind) written in C.
zmq_recv
call in the client is still stuck, even when the new server has been running for a minute. The only way to make progress for the client is to kill it (with Ctrl-C) and restart it.Q1: Is this the expected behavior? I'd expect that in a few seconds the client should figure out that the server is running again, and it would auto-reconnect.
Q2: What should I change in the example code to fix this?
Q3: Am I using the wrong version of the software, or is something broken on my system?
I've disabled the firewall, sudo iptables -S
prints -P INPUT ACCEPT
; -P FORWARD ACCEPT
; -P OUTPUT ACCEPT
.
In the strace -f ./hwclient
output I can see that the client is trying connect()
10 times a second (the default value of ZMQ_RECONNECT_IVL
) after the server went down. On the strace -f ./hwserver
output I can see that the restarted server accept()
s the connection. However, communication gets stuck after that, and the server never receives the actual request from the client (but it notices when I kill the client; also the server receives requests from other clients which have been started after the server restart).
Using ipc://
instead of tcp://
causes the same behavior.
The auto-reconnect happens in successfully in zmq_send
if the server has been killed before the client does the next zmq_send
. However, when the server gets killed while the client is running zmq_recv
, then the zmq_recv
blocks indefinitely, and the client can't seem to recover from that.
I've found this article, which recommends using timeouts. However, I think that timeouts can't be the right solution, because the TCP disconnect notification is already available in the client process, and it's already acting on it -- it just doesn't make zmq_recv
resend the request to the new server -- or at least return early indicating an error.
You may having the same issue that zeromq just fixed for me in 4.0.6 (issue 1362). Basically, the subscriber socket wouldn't always resend it's filter back over during a reconnection (an empty filter means no messages from publisher to that subscriber). The only way to recover was to restart the client's application. Their fix seems to have done the job. The issue was really highlighted when using a transport (like stunnel) to tunnel the connections. Without 4.0.6, I was able to get around the issue by setting the "immediate" flag on the subscriber socket.