haskellrepa

What is the fastest way to draw a rectangle in a REPA array?


Usually, a good method of drawing a rectangle in a bitmap is looping through the 2 bounding dimensions and setting individual pixels. For example, on pseudocode:

drawRect(array, color, x, X, y, Y):
    for x from x til X:
        for y from y til Y:
            array[x,y] = color

What is the equivalent on Haskell's REPA?


Solution

  • Of the ordinary REPA mechanisms for making new arrays, making an new delayed array is fastest when copying an array once to foreign memory. Actual performance using REPA will depend on what you do with your array.

    Let's define the type of a computation that depends only on the position in an array and the current value at that position.

    {-# LANGUAGE ScopedTypeVariables #-}
    
    import Data.Array.Repa hiding ((++))
    import Data.Array.Repa.Repr.ForeignPtr
    
    import Data.Word
    import Control.Monad
    import Data.Time.Clock
    import System.Mem
    
    type Ghost sh a b = sh -> a -> b
    

    We can define filling in any shape.

    fill :: Shape sh => sh -> sh -> a -> Ghost sh a a
    fill from to color = go
        where
            {-# INLINE go #-}
            go sh a =
                if inShapeRange from to sh
                then color
                else a
    

    We will use this with three different ways to define a new array - a delayed array, a structured traversal, and an unstructured traversal.

    The simplest delaying is fromFunction.

    ghostD :: (Shape sh, Source r a) => Ghost sh a b -> Array r sh a -> Array D sh b
    ghostD g a = fromFunction (extent a) go
        where
            {-# INLINE go #-}
            go sh = g sh (a ! sh)
    

    A structured traversal can take advantage of knowing the structure of the underlying array representation. Unfortunately, the only way we can get information about the position in a structured traversal is by zipping using szipWith with an array that somehow already contains the position information.

    ghostS :: (Shape sh, Structured r1 a b, Source r1 a) => Ghost sh a b -> Array r1 sh a -> Array (TR r1) sh b
    ghostS g a = szipWith ($) ghost a
        where
            ghost = fromFunction (extent a) g
    

    An unstructured traversal is very similar to a delayed array built by fromFunction; it also returns an Array D.

    ghostT :: (Shape sh, Source r a) => Ghost sh a b -> Array r sh a -> Array D sh b
    ghostT g a = traverse a id go
        where
            {-# INLINE go #-}
            go lookup sh = g sh (lookup sh)
    

    With some extremely naive benchmarking, we can run these and see how fast they are. We perform garbage collection before measuring the time to try to get reliable timing results. We'll have two benchmarks. For each mechanism,. we'll run a single step writing the result to memory 10 times. Then we'll compose 101 identical steps writing the result to memory once.

    bench :: Int -> String -> IO a -> IO ()
    bench n name action = do
        performGC
        start <- getCurrentTime
        replicateM_ n action    
        performGC
        end <- getCurrentTime
        putStrLn $ name ++ " " ++ (show (diffUTCTime end start / fromIntegral n))
    
    iterN :: Int -> (a -> a) -> (a -> a)
    iterN 0 f = id
    iterN n f = f . iterN (n-1) f
    
    main = do
        (img :: Array F DIM2 Word32) <- computeP (fromFunction (Z :. 1024 :. 1024 ) (const minBound))
        let (Z :. x :. y ) = extent img
            drawBox = fill (Z :. 20 :. 20 ) (Z :. x - 20 :. y - 20 ) maxBound
    
        bench 10 "Delayed      10x1" ((computeP $ ghostD drawBox img) :: IO (Array F DIM2 Word32))
        bench 10 "Unstructured 10x1" ((computeP $ ghostT drawBox img) :: IO (Array F DIM2 Word32))
        bench 10 "Structured   10x1" ((computeP $ ghostS drawBox img) :: IO (Array F DIM2 Word32))
    
        bench 1 "Delayed      1x101" ((computeP $ (iterN 100 (ghostD drawBox)) . ghostD drawBox $ img) :: IO (Array F DIM2 Word32))
        bench 1 "Unstructured 1x101" ((computeP $ (iterN 100 (ghostT drawBox)) . ghostT drawBox $ img) :: IO (Array F DIM2 Word32))
        bench 1 "Structured   1x101" ((computeP $ (iterN 100 (ghostS drawBox)) . ghostS drawBox $ img) :: IO (Array F DIM2 Word32))
    

    The resulting times are averaged over the number of times the array was forced by being written to foreign memory. These results are typical of multiple runs on my machine.

    Delayed      10x1 0.0234s
    Unstructured 10x1 0.02652s
    Structured   10x1 0.02652s
    Delayed      1x101 0.078s
    Unstructured 1x101 0.0936s
    Structured   1x101 0.2652s
    

    The results don't seem to depend on the order the benchmarks are run in.

    Structured   10x1 0.03276s
    Unstructured 10x1 0.02652s
    Delayed      10x1 0.01716s
    Structured   1x101 0.2184s
    Unstructured 1x101 0.1092s
    Delayed      1x101 0.0624s
    

    These results suggest you can do a handful of full-array computations and still have the performance be dominated by memory access to write the results.

    Libraries for rendering scenes by drawing on them typically have a very different structure from REPA, which is mostly built to do data processing tasks across all the data in parallel. Drawing and rendering libraries usually use a graph or tree of scene elements called a scene graph that allows them to quickly cull elements that won't be drawn in an image or a portion of an image. Mutating the result isn't necessary to achieve good performance if you can quickly cull everything that doesn't affect a specific pixel.