I am writing a parser for a markdown-like document format. I want to be able to match something like ^[some *formatted* text]
as a footnote in my syntax definition. Here's a minimal example:
{- cabal:
build-depends: base, text, megaparsec, parser-combinators, hspec, hspec-megaparsec
-}
{-# LANGUAGE ImportQualifiedPost #-}
{-# LANGUAGE OverloadedStrings #-}
import Data.Text (Text)
import Data.Void (Void)
import Test.Hspec
import Test.Hspec.Megaparsec
import Text.Megaparsec
import Text.Megaparsec.Char
import Text.Megaparsec.Char.Lexer qualified as L
type Parser = Parsec Void Text
data Words
= PlainText Text
| BoldText Text
| MagicText [Words]
deriving (Show, Eq)
text_ :: Parser Words
text_ =
choice
[
MagicText <$> between (string "^[") (char ']') (manyTill (text_ <* optional space) (char ']')),
BoldText <$> between (char '*') (char '*') (takeWhile1P (Just "bold text") (/= '*')),
PlainText <$> takeWhile1P (Just "plain text") (\c -> c /= ' ' && c /= '\n')
]
main :: IO ()
main = hspec $ do
context "for basic one-word-at-a-time input" $ do
it "parses plain text" $ parse text_ "" "hello" `shouldParse` PlainText "hello"
it "parses bold text" $ parse text_ "" "*hello*" `shouldParse` BoldText "hello"
context "parses nested \"MagicText\"" $ do
it "on it's own with just one word inside" $
parse text_ "" "^[hello]" `shouldParse` MagicText [PlainText "hello"]
it "on it's own with bold text inside" $
parse text_ "" "^[*hello*]" `shouldParse` MagicText [BoldText "hello"]
The last two test cases fail with the following errors:
~/sandbox > cabal run ParseBetween.hs
for basic one-word-at-a-time input
parses plain text [✔]
parses bold text [✔]
parses nested "MagicText"
on it's own with just one word inside [✘]
on it's own with bold text inside [✘]
Failures:
/home/gideon/sandbox/ParseBetween.hs:43:33:
1) parses nested "MagicText" on it's own with just one word inside
expected: MagicText [PlainText "hello"]
but parsing failed with error:
1:9:
|
1 | ^[hello]
| ^
unexpected end of input
expecting "^[", '*', ']', plain text, or white space
To rerun use: --match "/parses nested \"MagicText\"/on it's own with just one word inside/" --seed 100639639
/home/gideon/sandbox/ParseBetween.hs:46:35:
2) parses nested "MagicText" on it's own with bold text inside
expected: MagicText [BoldText "hello"]
but parsing failed with error:
1:11:
|
1 | ^[*hello*]
| ^
unexpected end of input
expecting ']'
To rerun use: --match "/parses nested \"MagicText\"/on it's own with bold text inside/" --seed 100639639
From the definition of manyTill_ I would expect it to match the ending ]
first, and therefore not run into this "unexpected end-of-input" error, but I can't work out how to have this nested parsing behaviour in a way which works.
I can't see by inspection what's wrong with your bold-text example. But the problem with "[hello]"
is simple enough. You start parsing MagicText
, which consumes the [
and delegates to text_
again, planning to consume a ]
afterwards. But the parser inside PlainText
doesn't know it's supposed to leave behind a ]
character. It happily consumes all the way to the end of the string, because it never encounters one of its stop characters, ' '
or '\n'
. Then it completes, and the MagicText
above it is upset it can't find its closing ]
.
A common way to handle problems like this is to have a grammar with more explicit separations of its concepts, encoded in a hierarchy. A MagicText
doesn't contain "any text, including magic, bold, or plain text": it includes "bold text or plain text". A BoldText
doesn't contain "any text, including magic, bold, or plain text": it contains only plain text. And PlainText
explicitly rejects characters that would be treated as delimiters/metacharacters for the levels above it. Roughly like this:
text_ :: Parser Words
text_ =
choice
[
MagicText <$> between (string "^[") (char ']') (nonMagicText `sepBy1` space),
nonMagicText
]
nonMagicText =
choice
[
BoldText <$> between (char '*') (char '*') plainText,
PlainText <$> plainText
]
plainText =
takeWhile1P (Just "plaintext") (`notElem` "*^[] \n")