I am stuck at the following parsing problem:
Parse some text string that may contain zero or more elements from a limited character set, up to but not including one of a set of termination characters. Content/no content should be indicated through Maybe
. Termination characters may appear in the string in escaped form. Parsing should fail on any inadmissible character.
This is what I came up with (simplified):
import qualified Text.Megaparsec as MP
-- Predicate for admissible characters, not including the control characters.
isAdmissibleChar :: Char -> Bool
...
-- Predicate for control characters that need to be escaped.
isControlChar :: Char -> Bool
...
-- The escape character.
escChar :: Char
...
pComponent :: Parser (Maybe Text)
pComponent = do
t <- MP.many (escaped <|> regular)
if null t then return Nothing else return $ Just (T.pack t)
where
regular = MP.satisfy isAdmissibleChar <|> fail "Inadmissible character"
escaped = do
_ <- MC.char escChar
MP.satisfy isControlChar -- only control characters may be escaped
Say, admissible characters are uppercase ASCII, escape is '\', and control is ':'.
Then, the following parses correctly: ABC\:D:EF
to yield ABC:D
.
However, parsing ABC&D
, where &
is inadmissible, does yield ABC
whereas I would expect an error message instead.
Two questions:
fail
end parsing instead of failing the parser?many
has to allow its sub-parser to fail once without the whole parse
failing - for example many (char 'A') *> char 'B'
, while parsing
"AAAB", has to fail to parse the B to know it got to the end of the
As.
You might want manyTill
which allows you to recognise the terminator
explicitly. Something like this:
MP.manyTill (escaped <|> regular) (MP.satisfy isControlChar)
"ABC&D" would give an error here assuming '&' isn't accepted by isControlChar.
Or if you want to parse more than one component you might keep your
existing definition of pComponent and use it with sepBy
or similar, like:
MP.sepBy pComponent (MP.satisfy isControlChar)
If you also check for end-of-file after this, like:
MP.sepBy pComponent (MP.satisfy isControlChar) <* MP.eof
then "ABC&D" should give an error again, because the '&' will end the first component but will not be accepted as a separator.