I am using RServe 1.7.3 on a headless RHEL 7.9 VM. On the client, I am using RserveCLI2.
On long running jobs, the TCP/IP connection becomes blocked by a fire wall, after 2 hours.
I came across the keep.alive
configuration option, that is available since RServe 1.7.2 (RServe News/Changelog).
The specs read:
added support for keep.alive configuration option - it is global to all servers and if enabled the client sockets are instructed to keep the connection alive by periodic messages.
I added the following to /etc/Rserv.conf
:
keep.alive enable
but this does no prevent the connection from being blocked.
Unfortunately, I cannot run a network monitoring tool, like Wireshark, to monitor the traffic between client and server.
How could I troubleshoot this?
Some specific questions I have:
/etc/Rserv.conf
, as specified in Documentation for Rserve? Notice that it does not have a final e, like Rserve.We got this to work.
To summarize, we adjusted some kernel settings to make sure keep-alive packets are send at shorter intervals to prevent the connection from being deemed dead by network components.
This is how and why.
The keep.alive enable
setting is in fact an instruction to the socket layer to periodically emit keep-alive packets from server to client. The client is expected to return an ACK on these packets. The behaviour is governed by three kernel-level settings, as explained in TCP Keepalive HOWTO - Using TCP keepalive under Linux:
tcp_keepalive_time
(defaults to 7200 seconds)tcp_keepalive_intvl
(defaults to 75 seconds)tcp_keepalive_probes
(defaults to 9 times)The tcp_keepalive_time
is the first time a keep-alive packet is sent, after establishing the tcp/ip connection. The tcp_keepalive_intvl
interval is de wait time between subsequent packets and tcp_keepalive_probes
the number of subsequent unacknowledged packets that make the system decide the connection is dead.
So, the first keep-alive packet was only send after 2 hours. After that time, some network component had already decided the connection was dead and the keep-alive packet never made it to the client and thus no ACK was ever send.
We lowered both tcp_keepalive_time
and tcp_keepalive_intvl
to 600 seconds.
With tcpdump -i [interface] port 6311
we were able to monitor the keep-alive packets.
15:40:11.225941 IP <server>.6311 <some node>.<port>: Flags [.], ack 1576, win 237, length 0 15:40:11.226196 IP <some node>.<port> <server>.6311: Flags [.], ack 401, win 511, length 0
This continues until the results are send back and the connection is closed. At least, I test for a duration of 12 hours.
So, we use keep-alive here not to check for dead peers, but to prevent disconnection due to network inactivity, as is discussed in TCP Keepalive HOWTO - 2.2. Why use TCP keepalive?. In that scenario, you want to use low values for keep-alive time and interval.
Note that these are kernel level settings, and thus are applied system-wide. We use a dedicated server, so this is no issue for us, but may be in other cases.
Finally, for completeness, I'll answer my own three questions.
/etc/Rserv.conf
, as was confirmed by changing another setting (remoted enable
to remote disable
).