haskellconcurrencybenchmarkingio-monadhaskell-criterion

Asynchronous code runs slower than synchronous version in haskell


Benchmarking the following:

#!/usr/bin/env stack
-- stack --resolver lts-16.2 script --package async --package criterion

import           Control.Concurrent.Async (async, replicateConcurrently_)
import           Control.Monad            (replicateM_, void)
import           Criterion.Main

main :: IO ()
main = defaultMain [
    bgroup "tests" [ bench "sync" $ nfIO syncTest
                   , bench "async" $ nfIO asyncTest
                   ]
    ]

syncTest :: IO ()
syncTest = replicateM_ 100000 dummy

asyncTest :: IO ()
asyncTest = replicateConcurrently_ 100000 dummy

dummy :: IO Int
dummy = return $ fib 10000000000

fib :: Int -> Int
fib 0 = 1
fib 1 = 1
fib n = fib (n - 1) + fib (n - 2)

Gives me this:

% ./applicative-v-monad.hs
benchmarking tests/sync
time                 2.120 ms   (2.075 ms .. 2.160 ms)
                     0.997 R²   (0.994 R² .. 0.999 R²)
mean                 2.040 ms   (2.023 ms .. 2.073 ms)
std dev              77.37 μs   (54.96 μs .. 122.8 μs)
variance introduced by outliers: 23% (moderately inflated)

benchmarking tests/async
time                 475.3 ms   (310.7 ms .. 642.8 ms)
                     0.984 R²   (0.943 R² .. 1.000 R²)
mean                 527.2 ms   (497.9 ms .. 570.9 ms)
std dev              41.30 ms   (4.833 ms .. 52.83 ms)
variance introduced by outliers: 21% (moderately inflated)

Where it is apparent that asyncTest runs longer than syncTest.

I would have thought that running expensive actions concurrently will be faster than running them in sequence. Is there some flaw in my reasoning?


Solution

  • There are a few problems with this benchmark.

    First of all laziness

    As @David Fletcher pointed out, you are not forcing computation of fib. The fix for this problem normally would be as easy as:

    dummy :: IO Int
    dummy = return $! fib 10000000000
    

    Which is enough to make us wait for eternity. Lowering it to something more manageable is the next thing we should do:

    dummy :: IO Int
    dummy = return $! fib 35
    

    This normally would be enough, however ghc is too smart and it will see that this computation is really pure and will optimize the loop of 100000 iterations into a single computation and return the same result 100000 times, so in reality it will compute this fib only once. Instead lets make fib depend on the number of iteration:

    xs :: [Int]
    xs = [1..35]
    
    syncTest :: IO ()
    syncTest = mapM_ dummy xs
    
    asyncTest :: IO ()
    asyncTest = mapConcurrently_ dummy xs
    
    dummy :: Int -> IO Int
    dummy n = return $! fib n
    

    Next problem is compilation

    stack script will run the code iterpreted and without threaded environment. So your code will run slow and sequentially. We fix it with manual compilation and some flags:

    $ stack exec --resolver lts-16.2 --package async --package criterion -- ghc -threaded -O2 -rtsopts -with-rtsopts=-N bench-async.hs
    $ stack exec --resolver lts-16.2 -- ./bench-async
    

    Of course, for a full blown stack project all these flags go into a cabal file instead and running stack bench will do the rest.

    Last, but not least. Too many threads.

    In the question you have asyncTest = replicateConcurrently_ 100000 dummy. Unless number of iterations is very low, which it is not, you don't want to use async for this because spawning at least a 100000 threads is not free, it is better to use a work stealing scheduler for this type of work load. I specifically wrote a library for this purpose: scheduler

    Here is an example how to use it:

    import qualified Control.Scheduler as S
    
    main :: IO ()
    main = defaultMain [
        bgroup "tests" [ bench "sync" $ whnfIO syncTest
                       , bench "async" $ nfIO asyncTest
                       , bench "scheduler" $ nfIO schedulerTest
                       ]
        ]
    schedulerTest :: IO ()
    schedulerTest = S.traverseConcurrently_ S.Par dummy xs
    

    Now this will give us more sensible numbers:

    benchmarking tests/sync
    time                 246.7 ms   (210.6 ms .. 269.0 ms)
                         0.989 R²   (0.951 R² .. 1.000 R²)
    mean                 266.4 ms   (256.4 ms .. 286.0 ms)
    std dev              21.60 ms   (457.3 μs .. 26.92 ms)
    variance introduced by outliers: 18% (moderately inflated)
    
    benchmarking tests/async
    time                 135.4 ms   (127.8 ms .. 147.9 ms)
                         0.992 R²   (0.980 R² .. 1.000 R²)
    mean                 134.8 ms   (129.7 ms .. 138.0 ms)
    std dev              6.578 ms   (3.605 ms .. 9.807 ms)
    variance introduced by outliers: 11% (moderately inflated)
    
    benchmarking tests/scheduler
    time                 109.0 ms   (96.83 ms .. 120.3 ms)
                         0.989 R²   (0.956 R² .. 1.000 R²)
    mean                 111.5 ms   (108.0 ms .. 120.2 ms)
    std dev              7.574 ms   (2.496 ms .. 11.85 ms)
    variance introduced by outliers: 12% (moderately inflated)