I'm currently encountering a problem while translating a parser from a CFG-based tool (antlr) to Megaparsec.
The grammar contains lists of expressions (handled with makeExprParser
) that are enclosed in brackets (<
, >
) and separated by ,
.
Stuff like <>
, <23>
, <23,87>
etc.
The problem now is that the expressions may themselves contain the >
operator (meaning "greater than"), which causes my parser to fail.
<1223>234>
should, for example, be parsed into [BinaryExpression ">" (IntExpr 1223) (IntExpr 234)]
.
I presume that I have to strategically place try
somewhere, but the places I tried (to the first argument of sepBy
and the first argument of makeExprParser
) did unfortunately not work.
Can I use makeExprParser
in such a situation or do I have to manually write the expression parser?:
This is the relevant part of my parser:
-- uses megaparsec, text, and parser-combinators
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Control.Monad.Combinators.Expr
import Data.Text
import Data.Void
import System.Environment
import Text.Megaparsec
import Text.Megaparsec.Char
import qualified Text.Megaparsec.Char.Lexer as L
type BinaryOperator = Text
type Name = Text
data Expr
= IntExpr Integer
| BinaryExpression BinaryOperator Expr Expr
deriving (Eq, Show)
type Parser = Parsec Void Text
lexeme :: Parser a -> Parser a
lexeme = L.lexeme sc
symbol :: Text -> Parser Text
symbol = L.symbol sc
sc :: Parser ()
sc = L.space space1 (L.skipLineComment "//") (L.skipBlockCommentNested "/*" "*/")
parseInteger :: Parser Expr
parseInteger = do
number <- some digitChar
_ <- sc
return $ IntExpr $ read number
parseExpr :: Parser Expr
parseExpr = makeExprParser parseInteger [[InfixL (BinaryExpression ">" <$ symbol ">")]]
parseBracketList :: Parser [Expr]
parseBracketList = do
_ <- symbol "<"
exprs <- sepBy parseExpr (symbol ",")
_ <- symbol ">"
return exprs
main :: IO ()
main = do
text : _ <- getArgs
let res = runParser parseBracketList "stdin" (pack text)
case res of
(Right suc) -> do
print suc
(Left err) ->
putStrLn $ errorBundlePretty err
You've (probably) misdiagnosed the problem. Your parser fails on <1233>234>
because it's trying to parse >
as a left associative operator, like +
. In other words, the same way:
1+2+
would fail, because the second +
has no right-hand operand, your parser is failing because:
1233>234>
has no digit following the second >
. Assuming you don't want your >
operator to chain (i.e., 1>2>3
is not a valid Expr
), you should first replace InfixL
with InfixN
(non-associative) in your makeExprParser
table. Then, it will parse this example fine.
Unfortunately, with or without this change your parser will still fail on the simpler test case:
<1233>
because the >
is interpreted as an operator within a continuing expression.
In other words, the problem isn't that your parser can't handle expressions with >
characters, it's that it's overly aggressive in treating >
characters as part of an expression, preventing them from being recognized as the closing angle bracket.
To fix this, you need to figure out exactly what you're parsing. Specifically, you need to resolve the ambiguity in your parser by precisely characterizing the situations where >
can be part of a continuing expression and where it can't.
One rule that will probably work is to only consider a >
as an operator if it is followed by a valid "term" (i.e., a parseInteger
). You can do this with lookAhead
. The parser:
symbol ">" <* lookAhead term
will parse a >
operator only if it is followed by a valid term
. If it fails to find a term, it will consume some input (at least the >
symbol itself), so you must surround it with a try
:
try (symbol ">" <* lookAhead term)
With the above two fixes applied to parseExpr
:
parseExpr :: Parser Expr
parseExpr = makeExprParser term
[[InfixN (BinaryExpression ">" <$ try (symbol ">" <* lookAhead term))]]
where term = parseInteger
you'll get the following parses:
λ> parseTest parseBracketList "<23>"
[IntExpr 23]
λ> parseTest parseBracketList "<23,87>"
[IntExpr 23,IntExpr 87]
λ> parseTest parseBracketList "<23,87>18>"
[IntExpr 23,BinaryExpression ">" (IntExpr 87) (IntExpr 18)]
However, the following will fail:
λ> parseTest parseBracketList "<23,87>18"
1:10:
|
1 | <23,87>18
| ^
unexpected end of input
expecting ',', '>', or digit
λ>
because the fact that the >
is followed by 18
means that it is a valid operator, and it is parse failure that the valid expression 87>18
is followed by neither a comma nor a closing >
angle bracket.
If you need to parse something like <23,87>18
, you have bigger problems. Consider the following two test cases:
<1,2>3,4,5,6,7,...,100000000000,100000000001>
<1,2>3,4,5,6,7,...,100000000000,100000000001
It's a challenge to write an efficient parser that will parse the first one as a list of 10000000000 expressions but the second one as a list of two expression:
[IntExpr 1, IntExpr 2]
followed by some "extra" text. Hopefully, the underlying "language" you're trying to parse isn't so hopelessly broken that this will be an issue.