arrayshaskellbytestringrepa

Reading samples, from a file, into an array


I've written a program that analyzes sample data that is contained an a file. Currently, my program reads the samples into a list and I perform the further analyzation/processing on the list of samples ([Float]).

I'm not quite happy with the performance and I'm thinking about using Arrays instead of lists for better performance. I'm also looking into parallelizing my implementation and Data.Array.Repa looks promising.

Currently, reading from the file goes something like this:

  1. I read all the samples into a ByteString, using hGet.
  2. I know each sample is represented by 3 bytes, so I group the ByteString into a list of ByteStrings of 3.
  3. I map my toFloat function on the list of ByteStrings to get a list of Floats.

This results in the [Float] that I analyze to get the desired information.

I'm wondering at which step in this process I should start using an Array. I first thought about using the listArray function to transform my [Float] into an array of floats. I'm not sure, but this doesn't seem to be the most efficient way.

Would it be possible to use Data.Array.Repa.fromFunction to construct the array right after step 2 and skip the intermediate [Float]? For the function, could I use something like (map toFloat bsList)? Where bsList is the list of ByteStrings after grouping.

Or is there a way to read the samples directly into an array?


Solution

  • Repa is actually capable of operating on ByteString as a backend for an array. So you can start processing ByteString in parallel right of the bet by trying something along those lines:

    #!/usr/bin/env stack
    -- stack runghc --package repa
    
    import Data.ByteString as BS
    import Data.Array.Repa as R
    import Data.Array.Repa.Repr.ByteString as R
    
    getFloatsArr :: ByteString -> Array D DIM1 Float
    getFloatsArr bs = R.traverse strArr (\(Z :. n) -> Z :. (n `div` 3)) getFloat where
      strArr = R.fromByteString (Z :. BS.length bs) bs
      getFloat getWord8 (Z :. k) =
        toFloat (getWord8 (Z :. k*3)) (getWord8 (Z :. k*3+1)) (getWord8 (Z :. k*3+2))
      toFloat = undefined -- convert to `Float` from 3 `Word8`s
    
    processFurther :: Array U DIM1 Float -> a
    processFurther = undefined
    
    main :: IO ()
    main = do
      bs <- BS.readFile "file.txt"
      arr <- R.computeUnboxedP $ getFloatsArr bs
      processFurther arr
      return ()