I wrote a Mandelbrot set generator in Haskell using the HIP library. Now I am trying to parallelize it. Unfortunately running following code on single then six cores seems to have no effect on performance.
Main.hs
module Main where
import Data.Complex
import Graphics.Image (RPU (RPU), writeImage, makeImageR, Pixel(PixelRGB), Image)
import Graphics.Image.ColorSpace (RGB)
import Debug.Trace (trace)
main :: IO ()
main = do
let image = makeImageR RPU (height, width) mandelbrotGenerator
writeImage "target.jpg" image
where mandelbrotGenerator :: (Int, Int) -> Pixel RGB Double
mandelbrotGenerator (x, y) = let reStart = -2
reEnd = 2
imStart = -2
imEnd = 2
x' = fromIntegral x
y' = fromIntegral y
width' = fromIntegral width
height' = fromIntegral height
c = (reStart + (x'/width')*(reEnd-reStart)) :+ (imStart + (y'/height')*(imEnd-imStart))
in plotd (mandelbrot c 80)
where
plotd r | r < 2 = PixelRGB 255 0 0
| otherwise = PixelRGB 0 0 255
height = 10000
width = 10000
mandelbrot c iter = realPart $ abs $ iterate (\z -> z^2 + c) (0 :+ 0) !! iter
I thought that in case of hip using RPU parameter and ghc-flags: -O2 -threaded
would be enough.
Here is code execution time with +RTS -s -N6
flags:
1,378,695,196,456 bytes allocated in the heap
486,041,624 bytes copied during GC
4,800,312,192 bytes maximum residency (4 sample(s))
412,800 bytes maximum slop
6896 MiB total memory in use (1 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 87335 colls, 87335 par 34.261s 23.647s 0.0003s 0.0052s
Gen 1 4 colls, 3 par 0.016s 0.015s 0.0037s 0.0127s
Parallel GC work balance: 46.93% (serial 0%, perfect 100%)
TASKS: 14 (1 bound, 13 peak workers (13 total), using -N6)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.003s ( 0.002s elapsed)
MUT time 942.718s (210.764s elapsed)
GC time 34.277s ( 23.662s elapsed)
EXIT time 0.001s ( 0.002s elapsed)
Total time 976.999s (234.430s elapsed)
Alloc rate 1,462,468,617 bytes per MUT second
Productivity 96.5% of total user, 89.9% of total elapsed
and here with only +RTS -s
:
1,378,694,953,992 bytes allocated in the heap
415,624,768 bytes copied during GC
4,800,096,768 bytes maximum residency (4 sample(s))
366,080 bytes maximum slop
6873 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 328664 colls, 0 par 1.885s 2.125s 0.0000s 0.0005s
Gen 1 4 colls, 0 par 0.013s 0.013s 0.0033s 0.0123s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.001s ( 0.000s elapsed)
MUT time 180.651s (180.173s elapsed)
GC time 1.898s ( 2.138s elapsed)
EXIT time 0.000s ( 0.009s elapsed)
Total time 182.550s (182.321s elapsed)
Alloc rate 7,631,819,383 bytes per MUT second
Productivity 99.0% of total user, 98.8% of total elapsed
EDIT: I noticed this is repa-only related issue. I changed code to following:
main :: IO ()
main = do
image <- (computeP $ fromFunction (Z:.width:.height) mandelbrotGenerator :: IO (Array U DIM2 (Pixel RGB Double)))
putStrLn "DONE"
-- writeImage "target.jpg" $ fromRepaArrayS image
where mandelbrotGenerator :: DIM2 -> Pixel RGB Double
mandelbrotGenerator (Z:.x:.y) = let reStart = -2
reEnd = 2
imStart = -2
imEnd = 2
x' = fromIntegral x
y' = fromIntegral y
width' = fromIntegral width
height' = fromIntegral height
c = (reStart + (x'/width')*(reEnd-reStart)) :+ (imStart + (y'/height')*(imEnd-imStart))
in plotd (mandelbrot c 80)
where
plotd r | r < 2 = PixelRGB 255 0 0
| otherwise = PixelRGB 0 0 255
height = 10000
width = 10000
mandelbrot c iter = realPart $ abs $ iterate (\z -> z^2 + c) (0 :+ 0) !! iter
and execution times are very similar on both one-core and six-core setup.
I can't replicate your problem.
On my 8-core, 16-thread laptop (with an Intel i9-9980HK), the runtime of your REPA version compiled under Stack LTS-20.20 (GHC 9.2.7 with flags -O2 -threaded
) is:
+RTS flags | elapsed time (sec) |
---|---|
-s -N1 | 121 |
-s -N2 | 71 |
-s -N3 | 52 |
-s -N4 | 44 |
-s -N6 | 46 |
-s -N8 | 63 |
-s -N16 | 75 |
so while there are some diminishing returns, there are definite gains when using 2-4 cores.
Maybe post the statistics output for the REPA version, try a few different -N
values, and consider building with a newer version of GHC and the various packages to see if that makes a difference.