I have a problem which I feel I have almost solved, but am stuck at the "last leg". I've been working on it for a few days now, and although have had breakthroughs, feel I am still a way off a workable solution - due to a few blind spots.
I have an embedded, mobile firmware application (which I can't edit directly), which is having trouble re-connecting to our server over TCP/IP/UMTS/GSM. The firmware uses AT commands only to initiate and control the connection, via a modem. The original version of our product has been working in production for around 10 years, however a "new" version of the hardware is experiencing intermittent problems.
There are many variables but I have ruled out all but one: the modem chip.
Of course I have endlessly read the Telit documentation (specifically "Easy IP", and the AT command, and software references) but cannot see much advertised difference between the two products. The HE910 is in a physical package by Glynn but I don't believe that affects its behaviour.
The problem:
Our firmware application connects to our server on a specific port/address. The server initiates an outbound application protocol to which the firmware must respond. (So really the firmware is the "server" but this doesn't matter.) It's very lightweight and simple, in the order of 10s of bytes per command/response.
The problem arises when the firmware re-connects, due to an event, "on top of" an existing socket connection. The firmware always follows the same process: teardown, configure, connect - regardless of whether there's an existing connection (this is due to the vagueness of "connected" on both TCP and 3G, it's best to either "try to send data", or re-connect to ensure a connection). As I said, I can't change this behaviour in the F/W.
The connect steps are:
(You might ask why the configuration steps 2,3,4 are done each connect: because they can be updated via our application protocol. But typically the settings are constant.)
The problem we're encountering is sometimes the devices connect but cannot communicate. The connections seem "stuck" - sometimes a few bytes get through, sometimes a whole command/response cycle, but often nothing. On connect about 10 commands normally are serviced, and if they fail the server drops the connection.
What I have found so far:
I have used a protocol analyser to establish what is going wrong: it's because the TCP FIN process of the AT#SH is overlapping with the TCP SYN handshake of the next connection initiated with the AT#SD. This seems to be due to delayed final FIN-ACK packets arriving late - after the next connection is being initiated.
In the trace below you can see Frame 8799 arrives late, as an ACK to 8788, but it comes out-of-sequence in the middle of the next SYN-ACK handshake!
After this happens the modem is basically "broken". All subsequent connections experience payload-packet loss, with lots of ReTransmits and both sides re-sending their payload packets as if they are not receiving the payload ACKs.
After step 1. (AT#SH) there is a 1 second pause in the F/W. We added this due to issues reconnecting when we first installed the HE910, (I know it is a kludge, but it was more to accomodate database-latency in our server state than for TCP/IP). However it seems this 1s delay isn't even affecting the TPC traffic, as you can see from the trace below. The ATSH "causes" Frame 8786 at 17:16:51.585, and the ATSD "causes" the SYN at Frame 8792 at 17:16:51.595, but there is only 10ms between them. I have studied the firmware and can confirm the 1s delay code looks right, and watching the firmware in Terminal - there IS a 1 second delay between the debug output of these steps - but TCP tells a different story.
I can send AT command in the firmware, in a limited manner - only one command every five seconds, which obviously skews manual tests to nice and slow. If I send +++, AT#SH during connection A, then trigger the connection event, the device will connect successfully. This is intriguing, as Step 1. of connecting is performing an AT#SH anyway. So I can conclude one of two things: either we need to send an AT#SH "earlier" (i.e. wait longer before connecting with AT#SD), or send two of them (less likely).
I understand that this is an area of common problems in applications. Even in the Wikipedia page on TCP/IP it mentions the risks of applications breaking the OSI model and falling foul of the TCP teardown handshake if used for application-level session termination. I'd love to add a nice "Goodbye" to our our application protocol, but that's not a viable option in the medium term... I'd like to get our two modems' TCP stacks aligned.
The server opens a new socket for the incoming connection, on the same IP/Port combination. The Node and the VB both create a new socket resource from a listening parent port, so I'm pretty sure at the server end nothing is being "reused" (it was a suspicion). Unfortunately I have limited vision into the modem's TCP stack, only using AT#SI and SO which give basic and seemingly non-realtime info. e.g. when the socket is closed at the server end, AT#SO still shows the IP/Port locally on the modem, like it hasn't received the FIN teardown.
From a fresh boot, the device can "reconnect" successfully around four times until the "jam" happens. This is strangely reliable to 4 +-1 connections. It can also be "reset" by waiting for some hours. So, (sorry) it feels like the modem just gets flustered. (Again, I did suspect our telco was "limiting" our connections, but cross-telco tests are just too similar. And consider in production we have thousands of these devices and just one of them can trigger this behaviour. It's just not statistically likely they could "notice" four connections within the hundreds of others?)
The final piece of behaviour is when the connection is "jammed" i.e not transmitting data, the server receives a RST (ECONNRESET in Node, just a Error in VB) after 30-60s. (Only if we disable our quicker application-level connection termination, so this went unnoticed for a long time). After this reset, however AT#SO still shows it as connected (with an IP), so again at first I felt it was some 3G/AP/gateway dropping the connection in the middle but not telling the client. However due to the consistency of the problem across Telcos and networks, I now find this hard to believe, and just think the AT responses aren't "up to date" with reality.
This typical TCP trace shows (my interpretation so far, which might not be right)
Trace
Frame TIME REL TIME SOURCE DEST SEQ ACK FLAGS WINDOW PAYLOAD BYTES
8003 17:15:55.9409634 703.7329764 CLIENT_IP SERVER_IP 0 0 (0x0) ......S. 65000 0 Flags=......S., SrcPort=17790, DstPort=50008, PayloadLen=0, Seq=316432024, Ack=0, Win=65000 ( ) = 65000 vb6.exe
8004 17:15:55.9410316 0.0000682 SERVER_IP CLIENT_IP 0 1 ...A..S. 8192 0 Flags=...A..S., SrcPort=50008, DstPort=17790, PayloadLen=0, Seq=319387462, Ack=316432025, Win=8192 ( Scale factor not supported ) = 8192 vb6.exe
8016 17:15:58.5707216 2.6296900 CLIENT_IP SERVER_IP 1 1 ...A.... 65000 0 Flags=...A...., SrcPort=17790, DstPort=50008, PayloadLen=0, Seq=316432025, Ack=319387463, Win=65000 (scale factor 0x0) = 65000 vb6.exe
8037 17:15:58.8632583 0.2925367 SERVER_IP CLIENT_IP 1 - 14 1 ...AP... 65000 13 Flags=...AP..., SrcPort=50008, DstPort=17790, PayloadLen=13, Seq=319387463 - 319387476, Ack=316432025, Win=65000 (scale factor 0x0) = 65000 vb6.exe
8039 17:15:59.0105830 0.1473247 CLIENT_IP SERVER_IP 1 14 ...A.... 65000 0 Flags=...A...., SrcPort=17790, DstPort=50008, PayloadLen=0, Seq=316432025, Ack=319387476, Win=65000 (scale factor 0x0) = 65000 vb6.exe
8041 17:15:59.1006112 0.0900282 CLIENT_IP SERVER_IP 1 - 7 14 ...AP... 65000 6 Flags=...AP..., SrcPort=17790, DstPort=50008, PayloadLen=6, Seq=316432025 - 316432031, Ack=319387476, Win=65000 (scale factor 0x0) = 65000 vb6.exe
8068 17:15:59.2997681 0.1991569 SERVER_IP CLIENT_IP 14 7 ...A.... 64994 0 Flags=...A...., SrcPort=50008, DstPort=17790, PayloadLen=0, Seq=319387476, Ack=316432031, Win=64994 (scale factor 0x0) = 64994 vb6.exe
8071 17:15:59.3590861 0.0593180 SERVER_IP CLIENT_IP 14 - 19 7 ...AP... 64994 5 Flags=...AP..., SrcPort=50008, DstPort=17790, PayloadLen=5, Seq=319387476 - 319387481, Ack=316432031, Win=64994 (scale factor 0x0) = 64994 vb6.exe
8076 17:15:59.5106103 0.1515242 CLIENT_IP SERVER_IP 7 19 ...A.... 65000 0 Flags=...A...., SrcPort=17790, DstPort=50008, PayloadLen=0, Seq=316432031, Ack=319387481, Win=65000 (scale factor 0x0) = 65000 vb6.exe
8078 17:15:59.6005478 0.0899375 CLIENT_IP SERVER_IP 7 - 28 19 ...AP... 65000 21 Flags=...AP..., SrcPort=17790, DstPort=50008, PayloadLen=21, Seq=316432031 - 316432052, Ack=319387481, Win=65000 (scale factor 0x0) = 65000 vb6.exe
8100 17:15:59.7989459 0.1983981 SERVER_IP CLIENT_IP 19 28 ...A.... 64973 0 Flags=...A...., SrcPort=50008, DstPort=17790, PayloadLen=0, Seq=319387481, Ack=316432052, Win=64973 (scale factor 0x0) = 64973 vb6.exe
8110 17:15:59.9164691 0.1175232 SERVER_IP CLIENT_IP 19 - 37 28 ...AP... 64973 18 Flags=...AP..., SrcPort=50008, DstPort=17790, PayloadLen=18, Seq=319387481 - 319387499, Ack=316432052, Win=64973 (scale factor 0x0) = 64973 vb6.exe
8114 17:16:00.0005725 0.0841034 CLIENT_IP SERVER_IP 28 37 ...A.... 65000 0 Flags=...A...., SrcPort=17790, DstPort=50008, PayloadLen=0, Seq=316432052, Ack=319387499, Win=65000 (scale factor 0x0) = 65000 vb6.exe
8120 17:16:00.2005077 0.1999352 CLIENT_IP SERVER_IP 28 - 34 37 ...AP... 65000 6 Flags=...AP..., SrcPort=17790, DstPort=50008, PayloadLen=6, Seq=316432052 - 316432058, Ack=319387499, Win=65000 (scale factor 0x0) = 65000 vb6.exe
8127 17:16:00.2509607 0.0504530 SERVER_IP CLIENT_IP 37 - 42 34 ...AP... 64967 5 Flags=...AP..., SrcPort=50008, DstPort=17790, PayloadLen=5, Seq=319387499 - 319387504, Ack=316432058, Win=64967 (scale factor 0x0) = 64967 vb6.exe
8131 17:16:00.5004721 0.2495114 CLIENT_IP SERVER_IP 34 42 ...A.... 65000 0 Flags=...A...., SrcPort=17790, DstPort=50008, PayloadLen=0, Seq=316432058, Ack=319387504, Win=65000 (scale factor 0x0) = 65000 vb6.exe
8132 17:16:00.5004721 0.0000000 CLIENT_IP SERVER_IP 34 - 43 42 ...AP... 65000 9 Flags=...AP..., SrcPort=17790, DstPort=50008, PayloadLen=9, Seq=316432058 - 316432067, Ack=319387504, Win=65000 (scale factor 0x0) = 65000 vb6.exe
8160 17:16:00.6603237 0.1598516 SERVER_IP CLIENT_IP 42 - 49 43 ...AP... 64958 7 Flags=...AP..., SrcPort=50008, DstPort=17790, PayloadLen=7, Seq=319387504 - 319387511, Ack=316432067, Win=64958 (scale factor 0x0) = 64958 vb6.exe
8162 17:16:00.7505392 0.0902155 CLIENT_IP SERVER_IP 43 49 ...A.... 65000 0 Flags=...A...., SrcPort=17790, DstPort=50008, PayloadLen=0, Seq=316432067, Ack=319387511, Win=65000 (scale factor 0x0) = 65000 vb6.exe
8168 17:16:00.9005226 0.1499834 CLIENT_IP SERVER_IP 43 - 55 49 ...AP... 65000 12 Flags=...AP..., SrcPort=17790, DstPort=50008, PayloadLen=12, Seq=316432067 - 316432079, Ack=319387511, Win=65000 (scale factor 0x0) = 65000 vb6.exe
8197 17:16:01.1093181 0.2087955 SERVER_IP CLIENT_IP 49 55 ...A.... 64946 0 Flags=...A...., SrcPort=50008, DstPort=17790, PayloadLen=0, Seq=319387511, Ack=316432079, Win=64946 (scale factor 0x0) = 64946 vb6.exe
8786 17:16:51.5853712 50.4760531 CLIENT_IP SERVER_IP 55 49 ...A...F 65000 0 Flags=...A...F, SrcPort=17790, DstPort=50008, PayloadLen=0, Seq=316432079, Ack=319387511, Win=65000 (scale factor 0x0) = 65000 vb6.exe
8787 17:16:51.5854172 0.0000460 SERVER_IP CLIENT_IP 49 56 ...A.... 64946 0 Flags=...A...., SrcPort=50008, DstPort=17790, PayloadLen=0, Seq=319387511, Ack=316432080, Win=64946 (scale factor 0x0) = 64946 vb6.exe
8788 17:16:51.5860033 0.0005861 SERVER_IP CLIENT_IP 49 56 ...A...F 64946 0 Flags=...A...F, SrcPort=50008, DstPort=17790, PayloadLen=0, Seq=319387511, Ack=316432080, Win=64946 (scale factor 0x0) = 64946 vb6.exe
8792 17:16:51.5954781 0.0094748 CLIENT_IP SERVER_IP 0 0 (0x0) ......S. 65000 0 Flags=......S., SrcPort=39522, DstPort=50008, PayloadLen=0, Seq=1817440308, Ack=0, Win=65000 ( ) = 65000 vb6.exe
8793 17:16:51.5955252 0.0000471 SERVER_IP CLIENT_IP 0 1 ...A..S. 8192 0 Flags=...A..S., SrcPort=50008, DstPort=39522, PayloadLen=0, Seq=4287314964, Ack=1817440309, Win=8192 ( Scale factor not supported ) = 8192 vb6.exe
8799 17:16:51.7152606 0.1197354 CLIENT_IP SERVER_IP 56 50 ...A.... 64999 0 Flags=...A...., SrcPort=17790, DstPort=50008, PayloadLen=0, Seq=316432080, Ack=319387512, Win=64999 (scale factor 0x0) = 64999 vb6.exe
8801 17:16:51.7351969 0.0199363 CLIENT_IP SERVER_IP 1 1 ...A.... 65000 0 Flags=...A...., SrcPort=39522, DstPort=50008, PayloadLen=0, Seq=1817440309, Ack=4287314965, Win=65000 (scale factor 0x0) = 65000 vb6.exe
8821 17:16:52.0333514 0.2981545 SERVER_IP CLIENT_IP 1 - 14 1 ...AP... 65000 13 Flags=...AP..., SrcPort=50008, DstPort=39522, PayloadLen=13, Seq=4287314965 - 4287314978, Ack=1817440309, Win=65000 (scale factor 0x0) = 65000 vb6.exe
8846 17:16:52.4579548 0.4246034 SERVER_IP CLIENT_IP 1 - 14 1 ...AP... 65000 13 [ReTransmit #8821]Flags=...AP..., SrcPort=50008, DstPort=39522, PayloadLen=13, Seq=4287314965 - 4287314978, Ack=1817440309, Win=65000 (scale factor 0x0) = 65000 vb6.exe
8855 17:16:53.3003681 0.8424133 SERVER_IP CLIENT_IP 1 - 14 1 ...AP... 65000 13 [ReTransmit #8821]Flags=...AP..., SrcPort=50008, DstPort=39522, PayloadLen=13, Seq=4287314965 - 4287314978, Ack=1817440309, Win=65000 (scale factor 0x0) = 65000 vb6.exe
8875 17:16:54.9852348 1.6848667 SERVER_IP CLIENT_IP 1 - 14 1 ...AP... 65000 13 [ReTransmit #8821]Flags=...AP..., SrcPort=50008, DstPort=39522, PayloadLen=13, Seq=4287314965 - 4287314978, Ack=1817440309, Win=65000 (scale factor 0x0) = 65000 vb6.exe
8890 17:16:56.6729734 1.6877386 SERVER_IP CLIENT_IP 1 - 14 1 ...AP... 65000 13 [ReTransmit #8821]Flags=...AP..., SrcPort=50008, DstPort=39522, PayloadLen=13, Seq=4287314965 - 4287314978, Ack=1817440309, Win=65000 (scale factor 0x0) = 65000 vb6.exe
8909 17:16:58.3734032 1.7004298 SERVER_IP CLIENT_IP 1 - 14 1 ...AP... 65000 13 [ReTransmit #8821]Flags=...AP..., SrcPort=50008, DstPort=39522, PayloadLen=13, Seq=4287314965 - 4287314978, Ack=1817440309, Win=65000 (scale factor 0x0) = 65000 vb6.exe
8921 17:17:01.7429692 3.3695660 SERVER_IP CLIENT_IP 1 - 14 1 ...AP... 65000 13 [ReTransmit #8821]Flags=...AP..., SrcPort=50008, DstPort=39522, PayloadLen=13, Seq=4287314965 - 4287314978, Ack=1817440309, Win=65000 (scale factor 0x0) = 65000 vb6.exe
8978 17:17:08.4665817 6.7236125 SERVER_IP CLIENT_IP 1 - 27 1 ...AP... 65000 26 [ReTransmit #8821]Flags=...AP..., SrcPort=50008, DstPort=39522, PayloadLen=26, Seq=4287314965 - 4287314991, Ack=1817440309, Win=65000 (scale factor 0x0) = 65000 vb6.exe
9016 17:17:12.2738575 3.8072758 SERVER_IP CLIENT_IP 27 1 ...A...F 65000 0 Flags=...A...F, SrcPort=50008, DstPort=39522, PayloadLen=0, Seq=4287314991, Ack=1817440309, Win=65000 (scale factor 0x0) = 65000 vb6.exe
Observations (with limited TCP knowledge)
So finally - my question.
How can I, purely with AT commands, reliably teardown a TCP/IP socket, and establish a new connection on a new socket to the same Port/IP with no chance of delayed teardown packets arriving out of sequence.
Or more specifically, how can I make the HE910 modem do this the same as the UC864 does quite successfully.
This problem actually turned out to be a bug in the Telit HE910 firmware v12.00.04.
Glyn said they were confident the upgraded would fix this problem, and the changelog had many bugfixes which looked related.
After upgrading to v12.00.06 all the problems went away.