performancekotlinparallel-processingkotlin-coroutines

Coroutines accelerate execution by more times than number of logical processors. Why?


I'm playing around with kotlin coroutines and now I am testing it on Nilakantha series (formula for calculating Pi).

Here is my code:

import kotlinx.coroutines.*
import kotlin.system.measureTimeMillis
import kotlin.Long.Companion.MAX_VALUE

const val MEASURE_COUNT = 10
const val ITERATIONS = 1_000_000_000

fun main() {

    val sequentialTime = measureTimeMultiple(ITERATIONS, ::calculatePiNilakanthaSequential, "calculatePiNilakanthaSequential")
    val parallelTime = measureTimeMultiple(ITERATIONS, ::calculatePiNilakanthaParallel, "calculatePiNilakanthaParallel")

    println("Parallelization Speedup: ${sequentialTime.toDouble() / parallelTime}")
}

fun calculatePiNilakanthaSequential(iterations: Int): Double {
    var pi = 3.0
    var sign = 1.0

    for (n in 1..iterations) {
        val numerator = 4.0 * sign
        val denominator = (2.0 * n) * (2.0 * n + 1) * (2.0 * n + 2)
        pi += numerator / denominator
        sign *= -1.0
    }

    return pi
}

fun calculatePiNilakanthaParallel(iterations: Int): Double {
    val cores = Runtime.getRuntime().availableProcessors()
    val chunkSize = iterations / cores

    val results = runBlocking {
        (0 until cores).map { core ->
            async(Dispatchers.Default) {
                val start = core * chunkSize + 1
                val end = if (core == cores - 1) iterations else (core + 1) * chunkSize
                var partialPi = 0.0
                var sign = if (start % 2 == 0) -1.0 else 1.0

                for (n in start..end) {
                    val numerator = 4.0 * sign
                    val denominator = (2.0 * n) * (2.0 * n + 1) * (2.0 * n + 2)
                    partialPi += numerator / denominator
                    sign *= -1.0
                }
                partialPi
            }
        }.awaitAll()
    }

    return 3.0 + results.sum()
}

fun measureTimeMultiple(iterations: Int, function: (Int) -> Any, functionName: String): Long {
    var minTime = MAX_VALUE

    for (i in 1..MEASURE_COUNT) {
        val time = measureTimeMillis {
            function(iterations)
        }
        if (time < minTime) minTime = time
    }

    println("$functionName() took minimum $minTime ms")
    return minTime
}

It gives the following output:

calculatePiNilakanthaSequential() took minimum 4499 ms
calculatePiNilakanthaParallel() took minimum 1003 ms
Parallelization Speedup: 4.485543369890329

My processor is Intel Core i3-1115G4. It has 2 physical and 4 logical cores. JDK version is 8.

According to Amdahl's law, it is impossible to get an acceleration greater than the number of logical processors. But every time I run this I get more than 4x acceleration.

Why?


Edit: I thought that maybe kotlin coroutines are somehow more optimized that just plain execution, so I tried creating one coroutine and doing all the work there. It expectedly didn't speed it up, so the answer is probably not in coroutines themselves, but in work partitioning/pipelining/cache? I have no idea


Solution

  • I'm fairly sure that this is due to the JIT compiler analyzing each method differently and making different decisions about how to compile it. The reason I came to that probable conclusion is as follows.

    Using Java 21, on my i7 (8-core) machine, I get a parallelization speedup of just over 8. Sometimes as high as 9.

    Running without JIT compilation (by setting the environment variable JAVA_OPTS=-Xint) the application runs much slower, and gives a parallelization speedup of about 2.5. Much less than the number of threads used by the coroutines, presumably due to the overhead of managing threads/coroutines in interpreted bytecode.

    Now, to make sure that both the sequential and parallel methods run the exact same calculation method in both cases, I changed your code to the following:

    fun calculatePiNilakanthaSequential(iterations: Int) =
        3.0 + calculatePiNilakanthaPartial(1, iterations)
    
    fun calculatePiNilakanthaParallel(iterations: Int): Double {
        val cores = Runtime.getRuntime().availableProcessors()
        val chunkSize = iterations / cores
    
        val results = runBlocking {
            (0 until cores).map { core ->
                async(Dispatchers.Default) {
                    val start = core * chunkSize + 1
                    val end = if (core == cores - 1) iterations else (core + 1) * chunkSize
                    calculatePiNilakanthaPartial(start, end)
                }
            }.awaitAll()
        }
    
        return 3.0 + results.sum()
    }
    
    fun calculatePiNilakanthaPartial(start: Int, end: Int): Double {
        var pi = 0.0
        var sign = if (start % 2 == 0) -1.0 else 1.0
    
        for (n in start..end) {
            val numerator = 4.0 * sign
            val denominator = (2.0 * n) * (2.0 * n + 1) * (2.0 * n + 2)
            pi += numerator / denominator
            sign *= -1.0
        }
    
        return pi
    }
    

    This way, the method calculatePiNilakanthaPartial is the one that takes the most time, and it's the same method that is called by both calculatePiNilakanthaSequential and calculatePiNilakanthaParallel. If that method gets JIT-compiled, the same compiled code gets run by both parallel and sequential methods.

    The result? A parellelization speedup of about 6. A decent speedup given 4 physical, 8 logical cores, but not more than 8.

    So, to conclude, it seems that the JIT compiler (in JVM versions 22 or earlier) compiled the calculatePiNilakanthaSequential method suboptimally for some reason. In Java 23, the JIT compiler does a much better job for the sequential method, which runs about 3 times faster, and hence the parallelization speedup is much lower.