Most of time, my rsh cycle is general OK, we could get following logs from rshd:
Aug 19 04:36:34 shmm500 authpriv.info in.rshd[21343]: connect from 172.17.0.40 (172.17.0.40)
Aug 19 04:36:34 shmm500 auth.info rshd[21344]: root@172.17.0.40 as root: cmd='echo 481'
While for some error case, the rsh could success but there are several seconds delay, see the following timestamp:
Aug 19 04:12:24 shmm500 authpriv.info in.rshd[17968]: connect from 172.17.0.40 (172.17.0.40)
Aug 19 04:12:27 shmm500 auth.info rshd[17972]: root@172.17.0.40 as root: cmd='echo 18'
I also found that, for most normal case, the PID increased by 1, while for most error case, PID increasd by 4, see the PID in above logs, seems rshd forks some processes. So would you provide any explanation for why rshd took these several seconds and PID increase.
Our rsh is the old rsh, not ssh, I'm not sure, but seems the rsh is from netkit. And this is an embedded board with busybox, no strace/pstack. For client side, I just 'rsh 172.17.0.8 pwd', not hostname is used.
Answer the question by myself:
This issue was caused by a frame loss. Either SYN or SYN+ACK in 3-way handshake was dropped at a rare rate for some reason, anyway the client peer didn't get the SYN+ACK within in 3 seconds timeout(this timeout is hardcoded in Linux kernel), then the connect() resent SYN again, and usually successful at the second try.
From the viewpoint of application, we got 3 seconds delay, or even 6 seconds if it failed at the second try.
Other relevant information:
The first log is from tcpd(aka tcp wrapper)
Aug 19 04:36:34 shmm500 authpriv.info in.rshd[21343]: connect from 172.17.0.40 (172.17.0.40)
The second log is from rshd in netkit 0.17
Aug 19 04:36:34 shmm500 auth.info rshd[21344]: root@172.17.0.40 as root: cmd='echo 481'
rsh need two tcp connections, the first is from rsh client to rshd, and the second tcp connection is from rshd to rsh client, which means the rshd is the tcp client. And my issue is frame loss on the second tcp connection.