async-awaittimeouttasktask-parallel-libraryflakyness

Task.WhenAny with Task.Delay to implement a timeout mechanism does not work reliable


From within my ASP.NET application (.NET Framework 4.8) I invoke an external process via Anonymous Pipes. Theoretically, it could happen that the external process hangs (for whatever reason) when my application writes to or reads from the stream. Hence I try to implement a timeout mechanism so that my application does not hang as well:

public static class TaskExtensions
{
    public static async Task<T> WithTimeout<T>([NotNull] this Task<T> task, int millisecondsTimeout, [CanBeNull] Func<T> onTimeout = null)
    {
        using (var timeoutCancellation = new CancellationTokenSource())
        {
            var delayTask = Task.Delay(millisecondsTimeout, timeoutCancellation.Token);
            var completedTask = await Task.WhenAny(task, delayTask).ConfigureAwait(false);

            if (completedTask == task)
            {
                // original task completed
                timeoutCancellation.Cancel();
                return await task;
            }

            // timeout
            return onTimeout != null ? onTimeout() : throw new TimeoutException();
        }
    }
}

Whenever I debug my application and execute this code everything works like a charm. Then I created some tests which at first looked fine, too. But when I ran them some more times, the test became flaky... Sometimes they succeed, sometimes not...

[TestFixture]
public class TaskExtensionsTests
{
    [Repeat(25)]
    [TestCase(10, 100, false)]
    [TestCase(100, 10, true)]
    public async Task WithTimeout(int longRunningTaskInMs, int waitTimeoutInMs, bool expectedWaitCallbackFired)
    {
        var timeOutCallbackFired = false;
        var actual = await Task.Run(async () =>
        {
            Console.WriteLine($"{DateTime.Now:O} long running task started...");
            await Task.Delay(longRunningTaskInMs);
            Console.WriteLine($"{DateTime.Now:O} long running task finished.");
            return "returned in time";
        }).WithTimeout(waitTimeoutInMs, () =>
        {
            Console.WriteLine("timeout!");
            timeOutCallbackFired = true;
            return "fired";
        });

        Assert.That(actual, Is.EqualTo(expectedWaitCallbackFired ? "fired" : "returned in time"));
        Assert.That(timeOutCallbackFired, Is.EqualTo(expectedWaitCallbackFired));
    }
}

You can see, that I am testing two scenarios:

On my development machine which is quite performant I never saw these tests failing. But after I pushed the code and my CI server (on-prem TeamCity) started doing its job I noticed the flakiness. The build agents are way less performent.

I then locally lowered the values from 10/100ms and 100/10ms to 10/25ms and 25/10ms. After doing that the test becomes flaky on my development machine as well. But I think the reason why the tests are now failing is something else. Namely the Windows clock resolution of approx 15.625ms. Therefore, in my opinion, it is not surprising that the tests with these low values become unreliable.

What I can't explain, or rather my real question, is why the tests on the build agents with the 10/100 and 100/10 values fail?! These values are 90ms apart, which is like an eternity...

I created a .NET fiddle to easily reproduce the issue: https://dotnetfiddle.net/pR7tmd If you run the fiddle multiple times the tests will eventually fail. Please note: For the fiddle I used .NET 9 instead of .NET Fx 4.8 and MSTest instead of NUnit. But this seems to be irrelevant since the tests are flaky with these components as well...


EDIT:

In the meantime I tried Stephen's suggestion to use TimeProvider for a more precise timing. This uses a high-performance time stamp based on Stopwatch.

My extension method no looks like this:

public static async Task<T> WithTimeoutWaitAsync<T>(this Task<T> task, int millisecondsTimeout, Func<T> onTimeout = null)
{
    using (var timeoutCancellation = new CancellationTokenSource())
    {
        try
        {
            return await task
                .WaitAsync(TimeSpan.FromMilliseconds(millisecondsTimeout), TimeProvider.System, timeoutCancellation.Token)
                .ConfigureAwait(false);
        }
        catch (TimeoutException timeoutException)
        {
            return onTimeout != null ? onTimeout() : throw timeoutException;
        }
    }
}

I curiously pushed this code and at first it looked very, very promising. Multiple TeamCity builds succeeded. But then... Every now and then single builds failed the same way as before 😭

I created another .NET fiddle for this new code: https://dotnetfiddle.net/8CRadY The error is reproducible there as well if you run the code a couple of times.


Solution

  • Really, your problem is the flaky tests. The code looks OK to me.

    What I can't explain, or rather my real question, is why the tests on the build agents with the 10/100 and 100/10 values fail?! These values are 90ms apart, which is like an eternity...

    I have found that any tests using timers are going to be at least somewhat flaky. In this case, you have a CI server, presumably with other (parallel) builds running concurrently. And it's on-prem, so there's likely antivirus that can just step in and muck things up.

    Bottom line: 100ms sounds like a long time, but in this scenario I'm not surprised the tests can be flaky.

    The best solution is to use something like .NET's new TimeProvider, which has been back-ported to .NET Framework. (Note that the backport also includes a Task.WaitAsync extension method, which is very similar to what your code is doing).

    See this blog post by Andrew Lock on how to using FakeTimeProvider in tests.