performancehaskellparallel-processingthreadscope

Haskell is slower in parallel


I've been experimenting with parallelizing Haskell programs. To my surprise most of my attempts made my examples run slower, so I decided to dig into Threadscope and started following the tutorial here.

In part 5 the example program sudoku3 is introduced, which is suppoed to be an example of a well-parallelized Haskell program. So I compiled it like described in the tutorial with

ghc -O2 sudoku3.hs -threaded -rtsopts -eventlog

and (not like described in the tutorial) measured how fast it is with and without parallelism. The results were surprising:

$ ./sudoku3 sudoku17.1000.txt +RTS -s -N1
   1,181,490,684 bytes allocated in the heap
      13,247,408 bytes copied during GC
       1,094,432 bytes maximum residency (5 sample(s))
          35,556 bytes maximum slop
               3 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      2282 colls,     0 par    0.07s    0.08s     0.0000s    0.0061s
  Gen  1         5 colls,     0 par    0.01s    0.01s     0.0012s    0.0044s

  TASKS: 3 (1 bound, 2 peak workers (2 total), using -N1)

  SPARKS: 1000 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 1000 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    2.51s  (  2.90s elapsed)
  GC      time    0.08s  (  0.08s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    2.59s  (  2.98s elapsed)

  Alloc rate    469,886,714 bytes per MUT second

  Productivity  97.0% of total user, 84.3% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

Here the run with two cores:

$ ./sudoku3 sudoku17.1000.txt +RTS -s -N2
   1,207,033,704 bytes allocated in the heap
      23,422,808 bytes copied during GC
       1,066,716 bytes maximum residency (22 sample(s))
          47,524 bytes maximum slop
               5 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1488 colls,  1488 par    0.40s    0.40s     0.0003s    0.0147s
  Gen  1        22 colls,    21 par    0.07s    0.06s     0.0026s    0.0087s

  Parallel GC work balance: 39.57% (serial 0%, perfect 100%)

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)

  SPARKS: 1000 (968 converted, 0 overflowed, 0 dud, 0 GC'd, 32 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    3.45s  (  2.96s elapsed)
  GC      time    0.47s  (  0.45s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    3.93s  (  3.41s elapsed)

  Alloc rate    349,389,354 bytes per MUT second

  Productivity  88.0% of total user, 101.4% of total elapsed

gc_alloc_block_sync: 913
whitehole_spin: 0
gen[0].sync: 25
gen[1].sync: 0

Surprisingly the run with two cores was slower. Why is this?

My GHC version is 7.6.3
Operating System: Debian Jessie Linux/GNU i386 (i686)


Solution

  • I just did the test again and like expected the parallel version is faster.

    It turns out, it is a bad idea to measure performance, while the CPU is 100% busy converting videos.