I'd like to understand why this simple parser runs out of memory for large files. I'm really clueless what am I doing wrong.
import Data.Attoparsec.ByteString.Char8
import qualified Data.Attoparsec.ByteString.Lazy as Lazy
import System.Environment
import qualified Data.ByteString.Lazy as B
import Control.Applicative
parseLine :: Parser String
parseLine = manyTill' anyChar (endOfLine <|> endOfInput)
parseAll :: Parser [Int]
parseAll = manyTill'
(parseLine >> (return 0)) -- discarding what's been read
endOfInput
main :: IO()
main = do
[fn] <- getArgs
text <- B.readFile fn
case Lazy.parse parseAll text of
Lazy.Fail _ _ _ -> putStrLn "bad"
Lazy.Done _ _ -> putStrLn "ok"
I'm running the program with:
runhaskell.exe test.hs x.log
Output:
test.hs: Out of memory
x.log is about 500MB in size. My machine has 16GB of RAM.
I'm not that familiar with Attoparsec, but I think you might have a difficult time using it, alone, to parse a huge file in constant memory. If you replace your top-level parser parseAll
with:
parseAll :: Parser ()
parseAll = skipMany anyChar
and profile it, you'll find that memory usage still grows without bound. (And when I converted your code to use incremental reading with strict ByteString
s, it didn't make any difference.)
I believe the problem is this: because Attoparsec does automatic backtracking, it has to be prepared for parseAll
(your version or mine -- it doesn't matter) to be used like this:
(parseAll <* somethingThatDoesntMatch) <|> parseDifferently
If parseAll
has parsed half a million lines and reaches the end, somethingThatDoesntMatch
will cause it to backtrack all the way back to the beginning and then reparse everything with parseDifferently
. So, the meta information for backtracking and the ByteStrings themselves can't be freed until the parse is completely finished.
Now, your parser (and my example above), "obviously" won't need to backtrack this way, but Attoparsec doesn't deduce this.
I can think of a couple of ways to proceed:
try
).