vb.netservicefreezedebugdiagrights-management

Windows-Services freeze irregularly


Because I am running out of arguments discussing with our admins, I hope you can help me with the following issue.

We have a strange behaviour corresponding to our self-implemented windows-services. They freeze randomly. Sometimes they keep on working for weeks and sometimes they freeze multiple times in a week. I am pretty sure, there is no problem with bad code or unhandled exceptions. In my opinion this is some kind of a windows admin/rights management problem in combination with chronological coincidence.

But let's start with some information at first:

Because I could not see any logged errors, I installed DebugDiag on the corresponding server, added crash rules for the mentioned services and perhaps found something interesting. Here is an extract of the DebugDiag log:

[12.06.2017 01:04:05]
  Thread created. New thread - System ID: 17372
[12.06.2017 01:04:29]
  Thread exited. Exiting thread - System ID: 7152. Exit code - 0x00000000
[12.06.2017 06:55:25]
  Thread created. New thread - System ID: 13252
  Thread exited. Exiting thread - System ID: 31012. Exit code - 0x00000000
  C:\Windows\System32\wship6.dll Unloaded from 0xfcee0000
  C:\Windows\System32\wshtcpip.dll Unloaded from 0xfc650000
  C:\Windows\System32\fwpuclnt.dll Unloaded from 0xfb1c0000
  C:\Windows\system32\security.dll Unloaded from 0x6f9e0000
  Thread exited. Exiting thread - System ID: 25912. Exit code - 0x00000000
  Thread exited. Exiting thread - System ID: 17372. Exit code - 0x00000000
  Thread exited. Exiting thread - System ID: 27412. Exit code - 0x00000000
  Thread exited. Exiting thread - System ID: 13252. Exit code - 0x00000000
  Thread exited. Exiting thread - System ID: 31768. Exit code - 0x00000000
  Thread exited. Exiting thread - System ID: 27540. Exit code - 0x00000000
  Thread exited. Exiting thread - System ID: 12252. Exit code - 0x00000000
  Thread exited. Exiting thread - System ID: 29336. Exit code - 0x00000000
  Thread exited. Exiting thread - System ID: 5620. Exit code - 0x00000000
  Thread exited. Exiting thread - System ID: 8248. Exit code - 0x00000000
  Thread exited. Exiting thread - System ID: 4340. Exit code - 0x00000000
  Thread exited. Exiting thread - System ID: 18056. Exit code - 0x00000000
  Thread exited. Exiting thread - System ID: 34164. Exit code - 0x00000000
  Process exited. Exit code - 0x00000000

The last sign of life of the service (let's say it was service A variant 2), that was frozen again at this time, was at 01:04:29, where one thread has been exited. At 06:55:25 the service has been restarted by one of our admins, because he saw, that the service seemed to be frozen. No dump was written by DebugDiag, so I assume again, that the service did not crash.

For me it was strange, that wship6.dll, wshtcpip.dll, fwpuclnt.dll and security.dll were unloaded while restarting the service, because I have not seen this yet. I tried to restart another variant of service A several times, which was not frozen. I saw the same entries, but they were written only after the first restart. Even after stopping and starting the service again, I could not see, that the libraries were unloaded.

So after a lot of information:

Edit 16.06.2017: Last night it was another windows service that stopped working with the same behaviour. Some variants of the windows service are frozen and some are still working. But this time you cannot see that the mentioned DLLs were unloaded while restarting the service. Maybe the first suspicion about the unloaded DLLs does not help for further diagnostics. One interesting fact: This service stopped working at the same time as the first service. Maybe there is a problem with the VM backups or something equivalent? I guess there is a regular task that is causing the problem. Do you have any hints?

Edit 19.06.2017: I guess we have found something interesting. The freezing services all have one .Net component in common: a filesystemwatcher. This has never been a problem in the past because we extended the .Net-filesystemwatcher with a self-reconnecting feature. The fileserver, which contains the path that is relevant for our filesystemwatcher, is backed up every night. Our filesystemwatcher reconnect feature checks every second, if this network path is unavailable. If so, the filesystemwatcher is reconnected after the path is available again. The hosting server, which manages all our virtual servers, has been upgraded a few days ago. So we have the following suspicion: Let's assume our windows service checks the network path at time t_1000 and t_2000. The virtual server backup disconnects the virtual file server, which contains the network path monitored by the filesystemwatcher, at time t_1200 and reconnects the path at t_1500. In this case our reconnect feature cannot work properly, because at t_1000 and t_2000 the network path was available. The filesystemwatcher nevertheless lost his connection and does not react to incoming files in the mentioned network path. This has not been a problem before, because the reconnect triggered by our backup software took some milliseconds longer due to the slower hardware used in this server. So our reconnect feature worked fine.

So what can we do?

Many thanks in advance.


Solution

  • Here is our solution for everyone, who is interested:

    The vendor of the backup software is aware of this problem, but is not willed to fix it. So we decided to create a new virtual machine, which is used as a fileserver for our needs. This new fileserver will not be backed up via snapshot.

    I did not find a way to further improve our filesystemwatcher, so I guess this is our only chance to solve the problem.