haskellhaskell-pipes

Read large lines in huge file without buffering


I was wondering if there's an easy way to get lines one at a time out of a file without eventually loading the whole file in memory. I'd like to do a fold over the lines with an attoparsec parser. I tried using Data.Text.Lazy.IO with hGetLine and that blows through my memory. I read later that eventually loads the whole file.

I also tried using pipes-text with folds and view lines:

s <- Pipes.sum $ 
    folds (\i _ -> (i+1)) 0 id (view Text.lines (Text.fromHandle handle))
print s

to just count the number of lines and it seems to be doing some wonky stuff "hGetChunk: invalid argument (invalid byte sequence)" and it takes 11 minutes where wc -l takes 1 minute. I heard that pipes-text might have some issues with gigantic lines? (Each line is about 1GB)

I'm really open to any suggestions, can't find much searching except for newbie readLine how-tos.

Thanks!


Solution

  • The following code uses Conduit, and will:

    You can replace the yield 1 code with something which will do processing on the individual lines.

    #!/usr/bin/env stack
    -- stack --resolver lts-8.4 --install-ghc runghc --package conduit-combinators
    import Conduit
    
    main :: IO ()
    main = (runConduit
         $ stdinC
        .| decodeUtf8C
        .| peekForeverE (lineC (yield (1 :: Int)))
        .| sumC) >>= print