socketsnetwork-programmingtcpat-command3g

How to reliably teardown and reconnect a TCP/IP socket with AT Commands


I have a problem which I feel I have almost solved, but am stuck at the "last leg". I've been working on it for a few days now, and although have had breakthroughs, feel I am still a way off a workable solution - due to a few blind spots.

I have an embedded, mobile firmware application (which I can't edit directly), which is having trouble re-connecting to our server over TCP/IP/UMTS/GSM. The firmware uses AT commands only to initiate and control the connection, via a modem. The original version of our product has been working in production for around 10 years, however a "new" version of the hardware is experiencing intermittent problems.

There are many variables but I have ruled out all but one: the modem chip.

Of course I have endlessly read the Telit documentation (specifically "Easy IP", and the AT command, and software references) but cannot see much advertised difference between the two products. The HE910 is in a physical package by Glynn but I don't believe that affects its behaviour.

The problem:

Our firmware application connects to our server on a specific port/address. The server initiates an outbound application protocol to which the firmware must respond. (So really the firmware is the "server" but this doesn't matter.) It's very lightweight and simple, in the order of 10s of bytes per command/response.

The problem arises when the firmware re-connects, due to an event, "on top of" an existing socket connection. The firmware always follows the same process: teardown, configure, connect - regardless of whether there's an existing connection (this is due to the vagueness of "connected" on both TCP and 3G, it's best to either "try to send data", or re-connect to ensure a connection). As I said, I can't change this behaviour in the F/W.

The connect steps are:

  1. a) Socket Teardown AT#SH b) Wait 1s
  2. Configure the PDP context AT+CGDCONT
  3. Configure TCP/IP stack : AT#SCFG
  4. Activate the PDP context: AT#SGACT
  5. Connect TCP/IP to the server: AT#SD

(You might ask why the configuration steps 2,3,4 are done each connect: because they can be updated via our application protocol. But typically the settings are constant.)

The problem we're encountering is sometimes the devices connect but cannot communicate. The connections seem "stuck" - sometimes a few bytes get through, sometimes a whole command/response cycle, but often nothing. On connect about 10 commands normally are serviced, and if they fail the server drops the connection.

What I have found so far:

I have used a protocol analyser to establish what is going wrong: it's because the TCP FIN process of the AT#SH is overlapping with the TCP SYN handshake of the next connection initiated with the AT#SD. This seems to be due to delayed final FIN-ACK packets arriving late - after the next connection is being initiated.

In the trace below you can see Frame 8799 arrives late, as an ACK to 8788, but it comes out-of-sequence in the middle of the next SYN-ACK handshake!

After this happens the modem is basically "broken". All subsequent connections experience payload-packet loss, with lots of ReTransmits and both sides re-sending their payload packets as if they are not receiving the payload ACKs.

After step 1. (AT#SH) there is a 1 second pause in the F/W. We added this due to issues reconnecting when we first installed the HE910, (I know it is a kludge, but it was more to accomodate database-latency in our server state than for TCP/IP). However it seems this 1s delay isn't even affecting the TPC traffic, as you can see from the trace below. The ATSH "causes" Frame 8786 at 17:16:51.585, and the ATSD "causes" the SYN at Frame 8792 at 17:16:51.595, but there is only 10ms between them. I have studied the firmware and can confirm the 1s delay code looks right, and watching the firmware in Terminal - there IS a 1 second delay between the debug output of these steps - but TCP tells a different story.

I can send AT command in the firmware, in a limited manner - only one command every five seconds, which obviously skews manual tests to nice and slow. If I send +++, AT#SH during connection A, then trigger the connection event, the device will connect successfully. This is intriguing, as Step 1. of connecting is performing an AT#SH anyway. So I can conclude one of two things: either we need to send an AT#SH "earlier" (i.e. wait longer before connecting with AT#SD), or send two of them (less likely).

I understand that this is an area of common problems in applications. Even in the Wikipedia page on TCP/IP it mentions the risks of applications breaking the OSI model and falling foul of the TCP teardown handshake if used for application-level session termination. I'd love to add a nice "Goodbye" to our our application protocol, but that's not a viable option in the medium term... I'd like to get our two modems' TCP stacks aligned.

The server opens a new socket for the incoming connection, on the same IP/Port combination. The Node and the VB both create a new socket resource from a listening parent port, so I'm pretty sure at the server end nothing is being "reused" (it was a suspicion). Unfortunately I have limited vision into the modem's TCP stack, only using AT#SI and SO which give basic and seemingly non-realtime info. e.g. when the socket is closed at the server end, AT#SO still shows the IP/Port locally on the modem, like it hasn't received the FIN teardown.

From a fresh boot, the device can "reconnect" successfully around four times until the "jam" happens. This is strangely reliable to 4 +-1 connections. It can also be "reset" by waiting for some hours. So, (sorry) it feels like the modem just gets flustered. (Again, I did suspect our telco was "limiting" our connections, but cross-telco tests are just too similar. And consider in production we have thousands of these devices and just one of them can trigger this behaviour. It's just not statistically likely they could "notice" four connections within the hundreds of others?)

The final piece of behaviour is when the connection is "jammed" i.e not transmitting data, the server receives a RST (ECONNRESET in Node, just a Error in VB) after 30-60s. (Only if we disable our quicker application-level connection termination, so this went unnoticed for a long time). After this reset, however AT#SO still shows it as connected (with an IP), so again at first I felt it was some 3G/AP/gateway dropping the connection in the middle but not telling the client. However due to the consistency of the problem across Telcos and networks, I now find this hard to believe, and just think the AT responses aren't "up to date" with reality.

This typical TCP trace shows (my interpretation so far, which might not be right)

Trace

Frame   TIME                REL TIME    SOURCE          DEST    SEQ ACK     FLAGS       WINDOW  PAYLOAD BYTES       

8003    17:15:55.9409634    703.7329764 CLIENT_IP   SERVER_IP   0   0 (0x0) ......S.    65000   0   Flags=......S., SrcPort=17790, DstPort=50008, PayloadLen=0, Seq=316432024, Ack=0, Win=65000 (  ) = 65000    vb6.exe
8004    17:15:55.9410316    0.0000682   SERVER_IP   CLIENT_IP   0   1       ...A..S.    8192    0   Flags=...A..S., SrcPort=50008, DstPort=17790, PayloadLen=0, Seq=319387462, Ack=316432025, Win=8192 ( Scale factor not supported ) = 8192    vb6.exe
8016    17:15:58.5707216    2.6296900   CLIENT_IP   SERVER_IP   1   1       ...A....    65000   0   Flags=...A...., SrcPort=17790, DstPort=50008, PayloadLen=0, Seq=316432025, Ack=319387463, Win=65000 (scale factor 0x0) = 65000  vb6.exe
8037    17:15:58.8632583    0.2925367   SERVER_IP   CLIENT_IP   1 - 14  1   ...AP...    65000   13  Flags=...AP..., SrcPort=50008, DstPort=17790, PayloadLen=13, Seq=319387463 - 319387476, Ack=316432025, Win=65000 (scale factor 0x0) = 65000 vb6.exe
8039    17:15:59.0105830    0.1473247   CLIENT_IP   SERVER_IP   1   14      ...A....    65000   0   Flags=...A...., SrcPort=17790, DstPort=50008, PayloadLen=0, Seq=316432025, Ack=319387476, Win=65000 (scale factor 0x0) = 65000  vb6.exe
8041    17:15:59.1006112    0.0900282   CLIENT_IP   SERVER_IP   1 - 7   14  ...AP...    65000   6   Flags=...AP..., SrcPort=17790, DstPort=50008, PayloadLen=6, Seq=316432025 - 316432031, Ack=319387476, Win=65000 (scale factor 0x0) = 65000  vb6.exe
8068    17:15:59.2997681    0.1991569   SERVER_IP   CLIENT_IP   14  7       ...A....    64994   0   Flags=...A...., SrcPort=50008, DstPort=17790, PayloadLen=0, Seq=319387476, Ack=316432031, Win=64994 (scale factor 0x0) = 64994  vb6.exe
8071    17:15:59.3590861    0.0593180   SERVER_IP   CLIENT_IP   14 - 19 7   ...AP...    64994   5   Flags=...AP..., SrcPort=50008, DstPort=17790, PayloadLen=5, Seq=319387476 - 319387481, Ack=316432031, Win=64994 (scale factor 0x0) = 64994  vb6.exe
8076    17:15:59.5106103    0.1515242   CLIENT_IP   SERVER_IP   7   19      ...A....    65000   0   Flags=...A...., SrcPort=17790, DstPort=50008, PayloadLen=0, Seq=316432031, Ack=319387481, Win=65000 (scale factor 0x0) = 65000  vb6.exe
8078    17:15:59.6005478    0.0899375   CLIENT_IP   SERVER_IP   7 - 28  19  ...AP...    65000   21  Flags=...AP..., SrcPort=17790, DstPort=50008, PayloadLen=21, Seq=316432031 - 316432052, Ack=319387481, Win=65000 (scale factor 0x0) = 65000 vb6.exe
8100    17:15:59.7989459    0.1983981   SERVER_IP   CLIENT_IP   19  28      ...A....    64973   0   Flags=...A...., SrcPort=50008, DstPort=17790, PayloadLen=0, Seq=319387481, Ack=316432052, Win=64973 (scale factor 0x0) = 64973  vb6.exe
8110    17:15:59.9164691    0.1175232   SERVER_IP   CLIENT_IP   19 - 37 28  ...AP...    64973   18  Flags=...AP..., SrcPort=50008, DstPort=17790, PayloadLen=18, Seq=319387481 - 319387499, Ack=316432052, Win=64973 (scale factor 0x0) = 64973 vb6.exe
8114    17:16:00.0005725    0.0841034   CLIENT_IP   SERVER_IP   28  37      ...A....    65000   0   Flags=...A...., SrcPort=17790, DstPort=50008, PayloadLen=0, Seq=316432052, Ack=319387499, Win=65000 (scale factor 0x0) = 65000  vb6.exe
8120    17:16:00.2005077    0.1999352   CLIENT_IP   SERVER_IP   28 - 34 37  ...AP...    65000   6   Flags=...AP..., SrcPort=17790, DstPort=50008, PayloadLen=6, Seq=316432052 - 316432058, Ack=319387499, Win=65000 (scale factor 0x0) = 65000  vb6.exe
8127    17:16:00.2509607    0.0504530   SERVER_IP   CLIENT_IP   37 - 42 34  ...AP...    64967   5   Flags=...AP..., SrcPort=50008, DstPort=17790, PayloadLen=5, Seq=319387499 - 319387504, Ack=316432058, Win=64967 (scale factor 0x0) = 64967  vb6.exe
8131    17:16:00.5004721    0.2495114   CLIENT_IP   SERVER_IP   34  42      ...A....    65000   0   Flags=...A...., SrcPort=17790, DstPort=50008, PayloadLen=0, Seq=316432058, Ack=319387504, Win=65000 (scale factor 0x0) = 65000  vb6.exe
8132    17:16:00.5004721    0.0000000   CLIENT_IP   SERVER_IP   34 - 43 42  ...AP...    65000   9   Flags=...AP..., SrcPort=17790, DstPort=50008, PayloadLen=9, Seq=316432058 - 316432067, Ack=319387504, Win=65000 (scale factor 0x0) = 65000  vb6.exe
8160    17:16:00.6603237    0.1598516   SERVER_IP   CLIENT_IP   42 - 49 43  ...AP...    64958   7   Flags=...AP..., SrcPort=50008, DstPort=17790, PayloadLen=7, Seq=319387504 - 319387511, Ack=316432067, Win=64958 (scale factor 0x0) = 64958  vb6.exe
8162    17:16:00.7505392    0.0902155   CLIENT_IP   SERVER_IP   43  49      ...A....    65000   0   Flags=...A...., SrcPort=17790, DstPort=50008, PayloadLen=0, Seq=316432067, Ack=319387511, Win=65000 (scale factor 0x0) = 65000  vb6.exe
8168    17:16:00.9005226    0.1499834   CLIENT_IP   SERVER_IP   43 - 55 49  ...AP...    65000   12  Flags=...AP..., SrcPort=17790, DstPort=50008, PayloadLen=12, Seq=316432067 - 316432079, Ack=319387511, Win=65000 (scale factor 0x0) = 65000 vb6.exe
8197    17:16:01.1093181    0.2087955   SERVER_IP   CLIENT_IP   49  55      ...A....    64946   0   Flags=...A...., SrcPort=50008, DstPort=17790, PayloadLen=0, Seq=319387511, Ack=316432079, Win=64946 (scale factor 0x0) = 64946  vb6.exe
8786    17:16:51.5853712    50.4760531  CLIENT_IP   SERVER_IP   55  49      ...A...F    65000   0   Flags=...A...F, SrcPort=17790, DstPort=50008, PayloadLen=0, Seq=316432079, Ack=319387511, Win=65000 (scale factor 0x0) = 65000  vb6.exe
8787    17:16:51.5854172    0.0000460   SERVER_IP   CLIENT_IP   49  56      ...A....    64946   0   Flags=...A...., SrcPort=50008, DstPort=17790, PayloadLen=0, Seq=319387511, Ack=316432080, Win=64946 (scale factor 0x0) = 64946  vb6.exe
8788    17:16:51.5860033    0.0005861   SERVER_IP   CLIENT_IP   49  56      ...A...F    64946   0   Flags=...A...F, SrcPort=50008, DstPort=17790, PayloadLen=0, Seq=319387511, Ack=316432080, Win=64946 (scale factor 0x0) = 64946  vb6.exe
8792    17:16:51.5954781    0.0094748   CLIENT_IP   SERVER_IP   0   0 (0x0) ......S.    65000   0   Flags=......S., SrcPort=39522, DstPort=50008, PayloadLen=0, Seq=1817440308, Ack=0, Win=65000 (  ) = 65000   vb6.exe
8793    17:16:51.5955252    0.0000471   SERVER_IP   CLIENT_IP   0   1       ...A..S.    8192    0   Flags=...A..S., SrcPort=50008, DstPort=39522, PayloadLen=0, Seq=4287314964, Ack=1817440309, Win=8192 ( Scale factor not supported ) = 8192  vb6.exe
8799    17:16:51.7152606    0.1197354   CLIENT_IP   SERVER_IP   56  50      ...A....    64999   0   Flags=...A...., SrcPort=17790, DstPort=50008, PayloadLen=0, Seq=316432080, Ack=319387512, Win=64999 (scale factor 0x0) = 64999  vb6.exe
8801    17:16:51.7351969    0.0199363   CLIENT_IP   SERVER_IP   1   1       ...A....    65000   0   Flags=...A...., SrcPort=39522, DstPort=50008, PayloadLen=0, Seq=1817440309, Ack=4287314965, Win=65000 (scale factor 0x0) = 65000    vb6.exe
8821    17:16:52.0333514    0.2981545   SERVER_IP   CLIENT_IP   1 - 14  1   ...AP...    65000   13  Flags=...AP..., SrcPort=50008, DstPort=39522, PayloadLen=13, Seq=4287314965 - 4287314978, Ack=1817440309, Win=65000 (scale factor 0x0) = 65000  vb6.exe
8846    17:16:52.4579548    0.4246034   SERVER_IP   CLIENT_IP   1 - 14  1   ...AP...    65000   13  [ReTransmit #8821]Flags=...AP..., SrcPort=50008, DstPort=39522, PayloadLen=13, Seq=4287314965 - 4287314978, Ack=1817440309, Win=65000 (scale factor 0x0) = 65000    vb6.exe
8855    17:16:53.3003681    0.8424133   SERVER_IP   CLIENT_IP   1 - 14  1   ...AP...    65000   13  [ReTransmit #8821]Flags=...AP..., SrcPort=50008, DstPort=39522, PayloadLen=13, Seq=4287314965 - 4287314978, Ack=1817440309, Win=65000 (scale factor 0x0) = 65000    vb6.exe
8875    17:16:54.9852348    1.6848667   SERVER_IP   CLIENT_IP   1 - 14  1   ...AP...    65000   13  [ReTransmit #8821]Flags=...AP..., SrcPort=50008, DstPort=39522, PayloadLen=13, Seq=4287314965 - 4287314978, Ack=1817440309, Win=65000 (scale factor 0x0) = 65000    vb6.exe
8890    17:16:56.6729734    1.6877386   SERVER_IP   CLIENT_IP   1 - 14  1   ...AP...    65000   13  [ReTransmit #8821]Flags=...AP..., SrcPort=50008, DstPort=39522, PayloadLen=13, Seq=4287314965 - 4287314978, Ack=1817440309, Win=65000 (scale factor 0x0) = 65000    vb6.exe
8909    17:16:58.3734032    1.7004298   SERVER_IP   CLIENT_IP   1 - 14  1   ...AP...    65000   13  [ReTransmit #8821]Flags=...AP..., SrcPort=50008, DstPort=39522, PayloadLen=13, Seq=4287314965 - 4287314978, Ack=1817440309, Win=65000 (scale factor 0x0) = 65000    vb6.exe
8921    17:17:01.7429692    3.3695660   SERVER_IP   CLIENT_IP   1 - 14  1   ...AP...    65000   13  [ReTransmit #8821]Flags=...AP..., SrcPort=50008, DstPort=39522, PayloadLen=13, Seq=4287314965 - 4287314978, Ack=1817440309, Win=65000 (scale factor 0x0) = 65000    vb6.exe
8978    17:17:08.4665817    6.7236125   SERVER_IP   CLIENT_IP   1 - 27  1   ...AP...    65000   26  [ReTransmit #8821]Flags=...AP..., SrcPort=50008, DstPort=39522, PayloadLen=26, Seq=4287314965 - 4287314991, Ack=1817440309, Win=65000 (scale factor 0x0) = 65000    vb6.exe
9016    17:17:12.2738575    3.8072758   SERVER_IP   CLIENT_IP   27  1       ...A...F    65000   0   Flags=...A...F, SrcPort=50008, DstPort=39522, PayloadLen=0, Seq=4287314991, Ack=1817440309, Win=65000 (scale factor 0x0) = 65000    vb6.exe

Observations (with limited TCP knowledge)

So finally - my question.

How can I, purely with AT commands, reliably teardown a TCP/IP socket, and establish a new connection on a new socket to the same Port/IP with no chance of delayed teardown packets arriving out of sequence.

Or more specifically, how can I make the HE910 modem do this the same as the UC864 does quite successfully.


Solution

  • This problem actually turned out to be a bug in the Telit HE910 firmware v12.00.04.

    Glyn said they were confident the upgraded would fix this problem, and the changelog had many bugfixes which looked related.

    After upgrading to v12.00.06 all the problems went away.