I have seen people recommending pipes/conduit library for various lazy IO related tasks. What problem do these libraries solve exactly?
Also, when I try to use some hackage related libraries, it is highly likely there are three different versions. Example:
This confuses me. For my parsing tasks should I use attoparsec or pipes-attoparsec/attoparsec-conduit? What benefit do the pipes/conduit version give me as compared to the plain vanilla attoparsec?
Lazy IO works like this
readFile :: FilePath -> IO ByteString
where ByteString
is guaranteed to only be read chunk-by-chunk. To do so we could (almost) write
-- given `readChunk` which reads a chunk beginning at n
readChunk :: FilePath -> Int -> IO (Int, ByteString)
readFile fp = readChunks 0 where
readChunks n = do
(n', chunk) <- readChunk fp n
chunks <- readChunks n'
return (chunk <> chunks)
but here we note that the IO action readChunks n'
is performed prior to returning even the partial result available as chunk
. This means we're not lazy at all. To combat this we use unsafeInterleaveIO
readFile fp = readChunks 0 where
readChunks n = do
(n', chunk) <- readChunk fp n
chunks <- unsafeInterleaveIO (readChunks n')
return (chunk <> chunks)
which causes readChunks n'
to return immediately, thunking an IO
action to be performed only when that thunk is forced.
That's the dangerous part: by using unsafeInterleaveIO
we've delayed a bunch of IO
actions to non-deterministic points in the future that depend upon how we consume our chunks of ByteString
.
What we'd like to do is slide a chunk processing step in between the call to readChunk
and the recursion on readChunks
.
readFileCo :: Monoid a => FilePath -> (ByteString -> IO a) -> IO a
readFileCo fp action = readChunks 0 where
readChunks n = do
(n', chunk) <- readChunk fp n
a <- action chunk
as <- readChunks n'
return (a <> as)
Now we've got the chance to perform arbitrary IO
actions after each small chunk is loaded. This lets us do much more work incrementally without completely loading the ByteString
into memory. Unfortunately, it's not terrifically compositional--we need to build our consumption action
and pass it to our ByteString
producer in order for it to run.
This is essentially what pipes
solves--it allows us to compose effectful co-routines with ease. For instance, we now write our file reader as a Producer
which can be thought of as "streaming" the chunks of the file when its effect gets run finally.
produceFile :: FilePath -> Producer ByteString IO ()
produceFile fp = produce 0 where
produce n = do
(n', chunk) <- liftIO (readChunk fp n)
yield chunk
produce n'
Note the similarities between this code and readFileCo
above—we simply replace the call to the coroutine action with yield
ing the chunk
we've produced so far. This call to yield
builds a Producer
type instead of a raw IO
action which we can compose with other Pipe
s types in order to build a nice consumption pipeline called an Effect IO ()
.
All of this pipe building gets done statically without actually invoking any of the IO
actions. This is how pipes
lets you write your coroutines more easily. All of the effects get triggered at once when we call runEffect
in our main
IO
action.
runEffect :: Effect IO () -> IO ()
So why would you want to plug attoparsec
into pipes
? Well, attoparsec
is optimized for lazy parsing. If you are producing the chunks fed to an attoparsec
parser in an effectful way then you'll be at an impasse. You could
pipes
(or conduit
) to build up a system of coroutines which include your lazy attoparsec
parser allowing it to operate on as little input as it needs while producing parsed values as lazily as possible across the entire stream.