I've written a program that analyzes sample data that is contained an a file. Currently, my program reads the samples into a list and I perform the further analyzation/processing on the list of samples ([Float]
).
I'm not quite happy with the performance and I'm thinking about using Arrays instead of lists for better performance. I'm also looking into parallelizing my implementation and Data.Array.Repa
looks promising.
Currently, reading from the file goes something like this:
ByteString
, using hGet
.ByteString
into a list of ByteString
s of 3.toFloat
function on the list of ByteString
s to get a list of Float
s. This results in the [Float]
that I analyze to get the desired information.
I'm wondering at which step in this process I should start using an Array. I first thought about using the listArray
function to transform my [Float]
into an array of floats. I'm not sure, but this doesn't seem to be the most efficient way.
Would it be possible to use Data.Array.Repa.fromFunction
to construct the array right after step 2 and skip the intermediate [Float]
? For the function, could I use something like (map toFloat bsList)
? Where bsList
is the list of ByteString
s after grouping.
Or is there a way to read the samples directly into an array?
Repa is actually capable of operating on ByteString as a backend for an array. So you can start processing ByteString in parallel right of the bet by trying something along those lines:
#!/usr/bin/env stack
-- stack runghc --package repa
import Data.ByteString as BS
import Data.Array.Repa as R
import Data.Array.Repa.Repr.ByteString as R
getFloatsArr :: ByteString -> Array D DIM1 Float
getFloatsArr bs = R.traverse strArr (\(Z :. n) -> Z :. (n `div` 3)) getFloat where
strArr = R.fromByteString (Z :. BS.length bs) bs
getFloat getWord8 (Z :. k) =
toFloat (getWord8 (Z :. k*3)) (getWord8 (Z :. k*3+1)) (getWord8 (Z :. k*3+2))
toFloat = undefined -- convert to `Float` from 3 `Word8`s
processFurther :: Array U DIM1 Float -> a
processFurther = undefined
main :: IO ()
main = do
bs <- BS.readFile "file.txt"
arr <- R.computeUnboxedP $ getFloatsArr bs
processFurther arr
return ()