I'm looking at this example from attoparsec docs:
simpleComment = string "<!--" *> manyTill anyChar (string "-->")
This will build a [Char]
instead of a ByteString
slice. That's not good with huge comments, right?
The other alternative, takeWhile:
takeWhile :: (Word8 -> Bool) -> Parser ByteString
cannot accept a parser (i.e. cannot match a ByteString
, only a Word8
).
Is there a way to parse chunk of ByteString
with attoparsec that doesn't involve building a [Char]
in the process?
You can use scan
:
scan :: s -> (s -> Word8 -> Maybe s) -> Parser ByteString
A stateful scanner. The predicate consumes and transforms a state argument, and each transformed state is passed to successive invocations of the predicate on each byte of the input until one returns Nothing or the input ends.
It would look something like this:
transitions :: [((Int, Char), Int)]
transitions = [((0, '-'), 1), ((1, '-'), 2), ((2, '-'), 2), ((2, '>'), 3)]
dfa :: Int -> Word8 -> Maybe Int
dfa 3 w = Nothing
dfa s w = lookup (s, toEnum (fromEnum w)) transitions <|> Just 0
And then use scan 0 dfa
to take bytes up to and including the final "-->"
. The state I'm using here tells how many characters of "-->"
we've seen so far. Once we've seen them all we inform scan
that it's time to stop. This is just to illustrate the idea; for efficiency you might want to use a more efficient data structure than association lists, move the *Enum
calls into the lookup table, and even consider writing the function directly.