parsinghaskellparsecmegaparsec

How to fail a nested megaparsec parser?


I am stuck at the following parsing problem:

Parse some text string that may contain zero or more elements from a limited character set, up to but not including one of a set of termination characters. Content/no content should be indicated through Maybe. Termination characters may appear in the string in escaped form. Parsing should fail on any inadmissible character.

This is what I came up with (simplified):

import qualified Text.Megaparsec as MP

-- Predicate for admissible characters, not including the control characters.
isAdmissibleChar :: Char -> Bool
...

-- Predicate for control characters that need to be escaped.
isControlChar :: Char -> Bool
...

-- The escape character.
escChar :: Char
...


pComponent :: Parser (Maybe Text)
pComponent = do
  t <- MP.many (escaped <|> regular)
  if null t then return Nothing else return $ Just (T.pack t)
 where
  regular = MP.satisfy isAdmissibleChar <|> fail "Inadmissible character"
  escaped = do
    _ <- MC.char escChar
    MP.satisfy isControlChar -- only control characters may be escaped

Say, admissible characters are uppercase ASCII, escape is '\', and control is ':'. Then, the following parses correctly: ABC\:D:EF to yield ABC:D. However, parsing ABC&D, where & is inadmissible, does yield ABC whereas I would expect an error message instead.

Two questions:


Solution

  • many has to allow its sub-parser to fail once without the whole parse failing - for example many (char 'A') *> char 'B', while parsing "AAAB", has to fail to parse the B to know it got to the end of the As.

    You might want manyTill which allows you to recognise the terminator explicitly. Something like this:

    MP.manyTill (escaped <|> regular) (MP.satisfy isControlChar)
    

    "ABC&D" would give an error here assuming '&' isn't accepted by isControlChar.

    Or if you want to parse more than one component you might keep your existing definition of pComponent and use it with sepBy or similar, like:

    MP.sepBy pComponent (MP.satisfy isControlChar)
    

    If you also check for end-of-file after this, like:

    MP.sepBy pComponent (MP.satisfy isControlChar) <* MP.eof
    

    then "ABC&D" should give an error again, because the '&' will end the first component but will not be accepted as a separator.