haskellparallel-processingrepa

How to enable parallelism in the HIP library


I wrote a Mandelbrot set generator in Haskell using the HIP library. Now I am trying to parallelize it. Unfortunately running following code on single then six cores seems to have no effect on performance.

Main.hs

module Main where
import Data.Complex
import Graphics.Image (RPU (RPU), writeImage, makeImageR, Pixel(PixelRGB), Image)
import Graphics.Image.ColorSpace (RGB)
import Debug.Trace (trace)

main :: IO ()
main = do
    let image = makeImageR RPU (height, width) mandelbrotGenerator
    writeImage  "target.jpg" image
    where mandelbrotGenerator :: (Int, Int) -> Pixel RGB Double
          mandelbrotGenerator (x, y) = let reStart = -2
                                           reEnd   =  2
                                           imStart = -2
                                           imEnd   =  2
                                           x' = fromIntegral x
                                           y' = fromIntegral y
                                           width' = fromIntegral width
                                           height' = fromIntegral height
                                           c = (reStart + (x'/width')*(reEnd-reStart)) :+ (imStart + (y'/height')*(imEnd-imStart))
                                        in plotd (mandelbrot c 80)
                                     where 
                                        plotd r | r < 2 = PixelRGB 255 0 0
                                                | otherwise = PixelRGB 0 0 255
          height = 10000
          width  = 10000
          mandelbrot c iter = realPart $ abs $ iterate (\z -> z^2 + c) (0 :+ 0) !! iter

I thought that in case of hip using RPU parameter and ghc-flags: -O2 -threaded would be enough.

Here is code execution time with +RTS -s -N6 flags:

1,378,695,196,456 bytes allocated in the heap
     486,041,624 bytes copied during GC
   4,800,312,192 bytes maximum residency (4 sample(s))
         412,800 bytes maximum slop
            6896 MiB total memory in use (1 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     87335 colls, 87335 par   34.261s  23.647s     0.0003s    0.0052s
  Gen  1         4 colls,     3 par    0.016s   0.015s     0.0037s    0.0127s

  Parallel GC work balance: 46.93% (serial 0%, perfect 100%)

  TASKS: 14 (1 bound, 13 peak workers (13 total), using -N6)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.003s  (  0.002s elapsed)
  MUT     time  942.718s  (210.764s elapsed)
  GC      time   34.277s  ( 23.662s elapsed)
  EXIT    time    0.001s  (  0.002s elapsed)
  Total   time  976.999s  (234.430s elapsed)

  Alloc rate    1,462,468,617 bytes per MUT second

  Productivity  96.5% of total user, 89.9% of total elapsed

and here with only +RTS -s:

1,378,694,953,992 bytes allocated in the heap
     415,624,768 bytes copied during GC
   4,800,096,768 bytes maximum residency (4 sample(s))
         366,080 bytes maximum slop
            6873 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     328664 colls,     0 par    1.885s   2.125s     0.0000s    0.0005s
  Gen  1         4 colls,     0 par    0.013s   0.013s     0.0033s    0.0123s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.000s elapsed)
  MUT     time  180.651s  (180.173s elapsed)
  GC      time    1.898s  (  2.138s elapsed)
  EXIT    time    0.000s  (  0.009s elapsed)
  Total   time  182.550s  (182.321s elapsed)

  Alloc rate    7,631,819,383 bytes per MUT second

  Productivity  99.0% of total user, 98.8% of total elapsed

EDIT: I noticed this is repa-only related issue. I changed code to following:

main :: IO ()
main = do
   image <- (computeP $ fromFunction (Z:.width:.height) mandelbrotGenerator :: IO (Array U DIM2 (Pixel RGB Double)))
   putStrLn "DONE"
--   writeImage  "target.jpg" $ fromRepaArrayS image
       where mandelbrotGenerator :: DIM2 -> Pixel RGB Double
             mandelbrotGenerator (Z:.x:.y) = let reStart = -2
                                                 reEnd   =  2
                                                 imStart = -2
                                                 imEnd   =  2
                                                 x' = fromIntegral x
                                                 y' = fromIntegral y
                                                 width' = fromIntegral width
                                                 height' = fromIntegral height
                                                 c = (reStart + (x'/width')*(reEnd-reStart)) :+ (imStart + (y'/height')*(imEnd-imStart))
                                              in plotd (mandelbrot c 80)
                                           where 
                                           plotd r | r < 2 = PixelRGB 255 0 0
                                                   | otherwise = PixelRGB 0 0 255
             height = 10000
             width  = 10000
             mandelbrot c iter = realPart $ abs $ iterate (\z -> z^2 + c) (0 :+ 0) !! iter

and execution times are very similar on both one-core and six-core setup.


Solution

  • I can't replicate your problem. On my 8-core, 16-thread laptop (with an Intel i9-9980HK), the runtime of your REPA version compiled under Stack LTS-20.20 (GHC 9.2.7 with flags -O2 -threaded) is:

    +RTS flags elapsed time (sec)
    -s -N1 121
    -s -N2 71
    -s -N3 52
    -s -N4 44
    -s -N6 46
    -s -N8 63
    -s -N16 75

    so while there are some diminishing returns, there are definite gains when using 2-4 cores.

    Maybe post the statistics output for the REPA version, try a few different -N values, and consider building with a newer version of GHC and the various packages to see if that makes a difference.