kotlinkotlin-coroutineskotlin-multiplatform

Explain reason for deadlock in kotlin coroutines


While experimenting with kotlin coroutines I encountered a situation where a deadlock occurs which i didnt expect.

I simplified the code to the following minimal code example showing the issues:

@Test
    fun deadlockTest() {
        runBlocking {
            val job = launch {
                runBlocking {
                    println("await cancellation")
                    awaitCancellation()
                }
            }
            println("launched job")
            delay(100)
            println("waited a bit")
            job.cancelAndJoin()
            println("canceled and joined")
        }
        assertTrue(true)
    }

The result is

launched job
await cancellation
waited a bit

It never goes beyond the job.cancelAndJoin, as if there were some deadlock.

If i change the code slightly to the following:

@Test
    fun fixedDeadlockTest() {
        runBlocking {
            val job = launch {
                withContext(Dispatchers.Default) { // <-- this is the only difference
                    println("awaiting cancellation")
                    awaitCancellation()
                }
            }
            println("launched job")
            delay(100)
            println("waited a bit")
            job.cancelAndJoin()
            println("canceled and joined")
        }
        assertTrue(true)
    }

Everything works fine, all lines get printed and the test completes.

The question is: why does this code result in a deadlock, and is it a bad practice to put a runBlocking inside the launch of another runBlocking? (i.e. in your code never use runBlocking, until you actually launch the coroutine from a non-coroutine scope?)

I used the following versions:


Solution

  • Coroutines use so called structured concurrency to support cancellations, exception handling, etc. They are structured into a tree of jobs, so usually, when you create new coroutines they become children of the current coroutine. There are specific responsibilities between parent and children, e.g. cancelling a parent cancels all its children.

    However, there are ways to start coroutines that are not attached to the current coroutine. This happens e.g. if you provide a specific CoroutineScope or if you use runBlocking(). Note that contrary to launch() or async(), runBlocking() does not require to be run from a coroutine. It was designed mainly to bridge non-coroutine and coroutine code, so it creates its coroutines starting from a "root" - they are detached from other coroutines.

    For above reason, cancelAndJoin() in your example cancels the coroutine running inside launch(), but it does not cancel the coroutine running inside runBlocking(). "awaitCancellation" coroutine is detached from the "launch" coroutine, so it ignores its cancellations.

    Below is my original answer. I was wrong about the main cause of this deadlock, but what I said is still mostly correct and it complements above answer, so I leave it as is. The reason why I was wrong is that I forgot that runBlocking() internally uses a thread local variable to store its event loop. That means runBlocking() running inside another runBlocking() actually share the same event loop / dispatcher, so by suspending at awaitCancellation() it could resume from delay(). Still, I think it is discouraged to do so.

    Original answer:

    Deadlock happens because outer runBlocking() starts a single-threaded coroutine dispatcher to launch coroutines inside it and inner runBlocking() blocks this single and only thread.

    You are correct, runBlocking() is mostly intended for bridging non-coroutine and coroutine code. There is no hard rule that using runBlocking() is forbidden inside coroutines, but generally we should avoid blocking inside coroutines, we should suspend instead. runBlocking() blocks, so it is discouraged and may result in consequences like above.