multithreadingasp.net-core.net-coredeadlock

Diagnose a frozen ASP.NET Core process


I'm diagnosing a frozen process which runs a .NET service in a docker container. The process goes unresponsive almost randomly after running for several hours. I have collected a few memory dumps of different instances where the process freezes, using the dotnet dump collect tool.

By analyzing these dumps, I see no significant pattern to locate the root of cause. It seems everything is fine, there is no OOM, no infinite loop, and no deadlock/livelock to my eyes. There is at most one worker thread running my code (i.e. not code from other libraries or .NET itself), and there does not seem to have any lock related issue with it.

Here is the digest of one of these dumps:

[1] The main thread: awaiting for tasks;
[12716] TP thread: running my code (capturing image from a camera)
5 TP threads: waiting for work to do;
[36, 37, 38] MongoDB threads: doing MongoDB related things, I'm pretty sure at the > moment there is no database activity;
[28] Serilog thread writing logs;
[12720] Processing the subscription of an observable;

And various other threads which I considered unlikely to be related to this problem.

Here is an exported image from Visual Studio's Parallel Stacks:

Parallel Stacks

I tried to make another dump an hour later and can see nothing has made any progress, the stacks still stay the same.

To me the issue is curious because:

Any thought is appreciated!


Solution

  • In our case this is most likely caused by a bug in the garbage collector. When the service falls in to the frozen state, all the threads are deadlocked by the GC.

    This is likely to be caused by one or more bugs in the .NET 9 runtime. There were various GC hang reports since the release of .NET 9 (1, 2 etc.), and the dotnet/runtime team has made several fixes accordingly. Our service was targeting an early version of .NET 9 runtime (9.0.102), after retargeting to the latest version (9.0.300), the issue did not appear again in our recent load test which spans 20+ hours.