Usually, a good method of drawing a rectangle in a bitmap is looping through the 2 bounding dimensions and setting individual pixels. For example, on pseudocode:
drawRect(array, color, x, X, y, Y):
for x from x til X:
for y from y til Y:
array[x,y] = color
What is the equivalent on Haskell's REPA?
Of the ordinary REPA mechanisms for making new arrays, making an new delayed array is fastest when copying an array once to foreign memory. Actual performance using REPA will depend on what you do with your array.
Let's define the type of a computation that depends only on the position in an array and the current value at that position.
{-# LANGUAGE ScopedTypeVariables #-}
import Data.Array.Repa hiding ((++))
import Data.Array.Repa.Repr.ForeignPtr
import Data.Word
import Control.Monad
import Data.Time.Clock
import System.Mem
type Ghost sh a b = sh -> a -> b
We can define filling in any shape.
fill :: Shape sh => sh -> sh -> a -> Ghost sh a a
fill from to color = go
where
{-# INLINE go #-}
go sh a =
if inShapeRange from to sh
then color
else a
We will use this with three different ways to define a new array - a delayed array, a structured traversal, and an unstructured traversal.
The simplest delaying is fromFunction
.
ghostD :: (Shape sh, Source r a) => Ghost sh a b -> Array r sh a -> Array D sh b
ghostD g a = fromFunction (extent a) go
where
{-# INLINE go #-}
go sh = g sh (a ! sh)
A structured traversal can take advantage of knowing the structure of the underlying array representation. Unfortunately, the only way we can get information about the position in a structured traversal is by zipping using szipWith
with an array that somehow already contains the position information.
ghostS :: (Shape sh, Structured r1 a b, Source r1 a) => Ghost sh a b -> Array r1 sh a -> Array (TR r1) sh b
ghostS g a = szipWith ($) ghost a
where
ghost = fromFunction (extent a) g
An unstructured traversal is very similar to a delayed array built by fromFunction
; it also returns an Array D
.
ghostT :: (Shape sh, Source r a) => Ghost sh a b -> Array r sh a -> Array D sh b
ghostT g a = traverse a id go
where
{-# INLINE go #-}
go lookup sh = g sh (lookup sh)
With some extremely naive benchmarking, we can run these and see how fast they are. We perform garbage collection before measuring the time to try to get reliable timing results. We'll have two benchmarks. For each mechanism,. we'll run a single step writing the result to memory 10 times. Then we'll compose 101 identical steps writing the result to memory once.
bench :: Int -> String -> IO a -> IO ()
bench n name action = do
performGC
start <- getCurrentTime
replicateM_ n action
performGC
end <- getCurrentTime
putStrLn $ name ++ " " ++ (show (diffUTCTime end start / fromIntegral n))
iterN :: Int -> (a -> a) -> (a -> a)
iterN 0 f = id
iterN n f = f . iterN (n-1) f
main = do
(img :: Array F DIM2 Word32) <- computeP (fromFunction (Z :. 1024 :. 1024 ) (const minBound))
let (Z :. x :. y ) = extent img
drawBox = fill (Z :. 20 :. 20 ) (Z :. x - 20 :. y - 20 ) maxBound
bench 10 "Delayed 10x1" ((computeP $ ghostD drawBox img) :: IO (Array F DIM2 Word32))
bench 10 "Unstructured 10x1" ((computeP $ ghostT drawBox img) :: IO (Array F DIM2 Word32))
bench 10 "Structured 10x1" ((computeP $ ghostS drawBox img) :: IO (Array F DIM2 Word32))
bench 1 "Delayed 1x101" ((computeP $ (iterN 100 (ghostD drawBox)) . ghostD drawBox $ img) :: IO (Array F DIM2 Word32))
bench 1 "Unstructured 1x101" ((computeP $ (iterN 100 (ghostT drawBox)) . ghostT drawBox $ img) :: IO (Array F DIM2 Word32))
bench 1 "Structured 1x101" ((computeP $ (iterN 100 (ghostS drawBox)) . ghostS drawBox $ img) :: IO (Array F DIM2 Word32))
The resulting times are averaged over the number of times the array was forced by being written to foreign memory. These results are typical of multiple runs on my machine.
Delayed 10x1 0.0234s
Unstructured 10x1 0.02652s
Structured 10x1 0.02652s
Delayed 1x101 0.078s
Unstructured 1x101 0.0936s
Structured 1x101 0.2652s
The results don't seem to depend on the order the benchmarks are run in.
Structured 10x1 0.03276s
Unstructured 10x1 0.02652s
Delayed 10x1 0.01716s
Structured 1x101 0.2184s
Unstructured 1x101 0.1092s
Delayed 1x101 0.0624s
These results suggest you can do a handful of full-array computations and still have the performance be dominated by memory access to write the results.
Libraries for rendering scenes by drawing on them typically have a very different structure from REPA, which is mostly built to do data processing tasks across all the data in parallel. Drawing and rendering libraries usually use a graph or tree of scene elements called a scene graph that allows them to quickly cull elements that won't be drawn in an image or a portion of an image. Mutating the result isn't necessary to achieve good performance if you can quickly cull everything that doesn't affect a specific pixel.