haskellattoparsec

Slices with attoparsec


I'm looking at this example from attoparsec docs:

simpleComment   = string "<!--" *> manyTill anyChar (string "-->")

This will build a [Char] instead of a ByteString slice. That's not good with huge comments, right?

The other alternative, takeWhile:

takeWhile :: (Word8 -> Bool) -> Parser ByteString

cannot accept a parser (i.e. cannot match a ByteString, only a Word8).

Is there a way to parse chunk of ByteString with attoparsec that doesn't involve building a [Char] in the process?


Solution

  • You can use scan:

    scan :: s -> (s -> Word8 -> Maybe s) -> Parser ByteString
    

    A stateful scanner. The predicate consumes and transforms a state argument, and each transformed state is passed to successive invocations of the predicate on each byte of the input until one returns Nothing or the input ends.

    It would look something like this:

    transitions :: [((Int, Char), Int)]
    transitions = [((0, '-'), 1), ((1, '-'), 2), ((2, '-'), 2), ((2, '>'), 3)]
    
    dfa :: Int -> Word8 -> Maybe Int
    dfa 3 w = Nothing
    dfa s w = lookup (s, toEnum (fromEnum w)) transitions <|> Just 0
    

    And then use scan 0 dfa to take bytes up to and including the final "-->". The state I'm using here tells how many characters of "-->" we've seen so far. Once we've seen them all we inform scan that it's time to stop. This is just to illustrate the idea; for efficiency you might want to use a more efficient data structure than association lists, move the *Enum calls into the lookup table, and even consider writing the function directly.