haskelllazy-evaluationconduit

How to take a lazy ByteString and write it to a file (in constant memory) using conduit


I am streaming the download of an S3 file using amazonka, and I use the sinkBody function to continue with the streaming. Currently, I download the file as follows:

getFile bucketName fileName = do
    resp <- send (getObject (BucketName bucketName) fileName)
    sinkBody (resp ^. gorsBody) sinkLazy

where sinkBody :: MonadIO m => RsBody -> ConduitM ByteString Void (ResourceT IO) a -> m a. In order to run in constant memory, I thought that sinkLazy is a good option for getting a value out of the conduit stream.

After this, I would like to save the lazy bytestring of data (S3 file) into a local file, for which I use this code:

-- fetch stream of data from S3
bytestream <- liftIO $ AWS.runResourceT $ runAwsT awsEnv $ getFile serviceBucket key

-- create a file
liftIO $ writeFile filePath  ""

-- write content of stream into the file (strict version), keeps data in memory...
liftIO $ runConduitRes $ yield bytestream .| mapC B.toStrict .| sinkFile filePath

But this code has the flaw that I need to "realise" all the lazy bytestring in memory, which means that it cannot run in constant space.

EDIT

I also tested writing the lazy bytestream directly to a file, as follows, but this consumes about 2 times the file size in memory. (The writeFile is from Data.ByteString.Lazy).

bytestream <- liftIO $ AWS.runResourceT $ runAwsT awsEnv $ getFile serviceBucket key
writeFile filename bytestream

Solution

  • Well, the purpose of a streaming library like conduit is to realize some of the benefits of lazy data structures and actions (lazy ByteStrings, lazy I/O, etc.) while better controlling memory usage. The purpose of the sinkLazy function is to take data out of the conduit ecosystem with its well controlled memory footprint and back into the wild West of lazy objects with associated space leaks. So, that's your problem right there.

    Rather than sink the stream out of conduit and into a lazy ByteString, you probably want to keep the data in conduit and sink the stream directly into the file, using something like sinkFile. I don't have an AWS test program up and running, but the following type checks and probably does what you want:

    import Conduit
    import Control.Lens
    import Network.AWS
    import Network.AWS.S3
    
    getFile bucketName fileName outputFileName = do
        resp <- send (getObject (BucketName bucketName) fileName)
        sinkBody (resp ^. gorsBody) (sinkFile outputFileName)