Isn't Parsec
's try
supposed to backtrack when it encounters failure?
For instance, if I have the code
import Control.Applicative ((<|>))
import Debug.Trace
import Text.Parsec (try)
import Text.Parsec.Combinator (eof)
import Text.Parsec.String (Parser)
import qualified Data.Char as Char
import qualified Text.Parsec as Parsec
data Func = Add | Sub deriving (Show)
add :: Parser Func
add = do
Parsec.string "+"
Parsec.notFollowedBy $ Parsec.satisfy (\_-> True)
return Add
sub :: Parser Func
sub = do
Parsec.string "-"
Parsec.notFollowedBy $ Parsec.satisfy (\_-> True)
return Sub
func :: Parser Func
func = add <|> sub
data AST =
Primitive String
| BinOp Func AST AST
deriving (Show)
primitive :: Parser AST
primitive = do
str <- Parsec.many1 $ Parsec.satisfy $ not . Char.isSpace
trace "primitive returning" $ return $ Primitive str
binOp :: Parser AST
binOp = do
lhs <- parser
Parsec.spaces
operation <- func
Parsec.spaces
rhs <- parser
return $ BinOp operation lhs rhs
parser :: Parser AST
parser = try primitive <|> binOp
sample :: String
sample = "+#-! + -$!&+"
main :: IO ()
main = print $ Parsec.parse (parser <* eof) "failure" sample
I get the output:
primitive returning
Left "failure" (line 1, column 5):
unexpected ' '
expecting end of input
However, from my understanding, when the parser encounters try primitive <|> binOp
and primitive
fails, it should go to the binOp
option.
Desired parse is
BinOp Add (Primitive "+#-!") (Primitive "-$!&+")
The documentation for try
doesn't actually use the word "backtrack". What it says is:
The parser
try p
behaves like parserp
, except that it pretends that it hasn't consumed any input when an error occurs.
This may make little sense until you match it up with the documentation for the (<|>)
operator which states:
The parser
p <|> q
first appliesp
. If it succeeds, the value ofp
is returned. Ifp
fails without consuming any input, parserq
is tried.
Note the emphasized phrase -- that's emphasis in the original documentation, not emphasis I've added.
Here's how this leads to "backtracking". Consider a simplified version of your parser that allows no spaces, only single-letter primitives, and addition of two primitives as the only operation:
...
import Text.Parsec as Parsec
import Text.Parser.Char as Parsec
...
add :: Parser Func
add = do
Parsec.string "+"
return Add
primitive :: Parser AST
primitive = do
c <- Parsec.letter
return $ Primitive (c:"")
binOp :: Parser AST
binOp = do
lhs <- primitive
operation <- add
rhs <- primitive
return $ BinOp operation lhs rhs
parser :: Parser AST
parser = binOp <|> primitive
This parser doesn't work. It parses "x+y"
okay:
λ> parseTest parser "x+y"
BinOp Add (Primitive "x") (Primitive "y")
but it fails to parse "x"
alone:
λ> parseTest parser "x"
parse error at (line 1, column 2):
unexpected end of input
expecting "+"
The problem in this second case is that the <|>
operator tries binOp
first. The binOp
parser consumes a primitive, namely "x"
, and then fails to parse add
(because there's no "+"
character). As a result, binOp
fails after consuming input. The <|>
doesn't try its right-hand side alternative parser (primitive
) because p <|> q
only tries q
if p
fails without consuming input, and that's not what happened here. In other words, by default, after the left-hand side of the alternative consumes some input and then fails, there is no "backtracking" to put that consumed input back.
The try
combinator changes this behavior by "undoing" the consumption of input in case of failure. If we fix the parser:
parser :: Parser AST
parser = try binOp <|> primitive
then it works fine on the input "x"
:
λ> parseTest parser "x"
Primitive "x"
As before, the <|>
operator tries the try binOp
parser first. The try binOp
parser acts just like binOp
-- it consumes a primitive and then fails to parse add
, resulting in a "failure after consuming input". But try
"backtracks" by replacing this with a "failure without consuming any input" (essentially, putting the input back). This satisfies the requirement for <|>
to try the alternative parser primitive
which, of course, successfully parses "x"
, and that's the final result of the parse.
How is this different from your situation? Well, you've got the equivalent of:
parser :: Parser AST
parser = try primitive <|> binOp
This parser is broken, too. It works on "x"
:
λ> parseTest parser "x"
Primitive "x"
and it "works" on "x+y"
but returns the wrong answer:
λ> parseTest parser "x+y"
Primitive "x"
The problem is that the <|>
first tries the try primitive
parser, which acts just like primitive
. Even in the second test case, it succeeds after consuming the primitive "x"
(leaving "+y"
yet to be parsed). Note that try
has no effect, because the only thing try
does is take "failure after consuming input" and replace it with "failure without consuming any input", but primitive
doesn't fail! It succeeds! So, there's no backtracking because there's no failure to backtrack from. The result of parser
is Primitive "x"
, and now there's some unparseable garbage "+y"
left that will trigger a failure in some other parser that follows parser
, like eof
.
So, to start to fix your parser, you could replace the definition of parser
with:
parser :: Parser AST
parser = try binOp <|> primitive
Here, the try binOp
will start to parse a binOp
, and if it fails part way through (e.g., after parsing a primitive but not finding an operator), then try
will trigger backtracking so the primitive
parser can be tried instead.
Unfortunately, this will cause an infinite loop in your parser because parser
starts by trying binOp
which starts by invoking parser
(for the left-hand side) which starts by trying binOp
, etc., etc., and neither parser makes any progress.
There are a few solutions to this problem. There is machinery in Text.Parsec.Expr
for handling expressions in general, and there are functions like sepBy
or chainl1
for correctly parsing a sequence of terms separated by operators (without creating an infinite loop).
If you want to do it manually, you'll end up having to do something like:
parser :: Parser AST
parser = do
-- there's always a primitive
lhs <- primitive
Parsec.spaces
-- it may be followed by some binOp/primitives
followedByBinOp lhs
followedByBinOp :: AST -> Parser AST
followedByBinOp lhs =
-- if there's another binop/primitive
(do operator <- func
Parsec.spaces
rhs <- primitive
Parsec.spaces
-- then loop with a new LHS
followedByBinOp (BinOp operator lhs rhs)
-- otherwise, return what we've got
<|> return lhs)
This is super ugly, which is why chainl1
was invented. The above can be replaced with:
parser :: Parser AST
parser = Parsec.chainl1 (primitive <* Parsec.spaces) (BinOp <$> func <* Parsec.spaces)
With either of these versions of parser
, your parser still won't work because of two more problems. First, your primitive
doesn't know when to stop:
primitive :: Parser AST
primitive = do
str <- Parsec.many1 $ Parsec.satisfy $ not . Char.isSpace
trace "primitive returning" $ return $ Primitive str
Here, the only way to end a primitive is with a space. That means that the expression "x+y+z"
is a valid single primitive. If you try to parse this with parser
, the initial lhs <- primitive
will eat the whole string, and you'll get back Primitive "x+y+z"
. If you don't mind spaces being mandatory, then you'll be able to get it to parse "x + y + z"
, but if you'd like spaces to be optional, you'll need to somehow restrict the allowed primitives. (There's a reason most languages doesn't allow completely arbitrary identifiers. If x+y
or -- worse -- +#-!
-- is a valid identifier, it's hard to parse expressions with infix operators.)
If you allow only alphanumeric strings as primitives, that'll get you a little closer:
primitive :: Parser AST
primitive = do
str <- Parsec.many1 $ Parsec.satisfy $ Char.isAlphaNum
return $ Primitive str
However, your parser still won't work because add
and sub
are misusing notFollowedBy
. Consider add
:
add :: Parser Func
add = do
Parsec.string "+"
Parsec.notFollowedBy $ Parsec.satisfy (\_-> True)
return Add
This works by first parsing a "+"
character. Then, it tries to parse another character. If that character satisfies the predicate (which it does because it's always True
), then the whole parse will fail after consuming input. That's because notFollowedBy
is designed to fail if its parser succeeds. So, except in the case where a "+"
occurs at the end of the string, this add
parser always fails after consuming input, aborting the entire parse.
For now, at least, it's probably best to just delete those notFollowedBy
lines.
Anyway, here's a modified version of your parser that more or less works, but only accepts alphanumeric primitives. It actually doesn't need to backtrack, so it doesn't use try
anywhere:
import Control.Applicative ((<|>))
import Text.Parsec.Combinator (eof)
import Text.Parsec.String (Parser)
import qualified Data.Char as Char
import qualified Text.Parsec as Parsec
data Func = Add | Sub deriving (Show)
add :: Parser Func
add = do
Parsec.string "+"
-- Parsec.notFollowedBy $ Parsec.satisfy (\_-> True)
return Add
sub :: Parser Func
sub = do
Parsec.string "-"
-- Parsec.notFollowedBy $ Parsec.satisfy (\_-> True)
return Sub
func :: Parser Func
func = add <|> sub
data AST =
Primitive String
| BinOp Func AST AST
deriving (Show)
primitive :: Parser AST
primitive = do
str <- Parsec.many1 $ Parsec.satisfy $ Char.isAlphaNum
return $ Primitive str
parser :: Parser AST
parser = do
-- there's always a primitive
lhs <- primitive
Parsec.spaces
-- it may be followed by some binOp/primitives
followedByBinOp lhs
followedByBinOp :: AST -> Parser AST
followedByBinOp lhs =
-- if there's another binop/primitive
(do operator <- func
Parsec.spaces
rhs <- primitive
Parsec.spaces
-- then loop with a new LHS
followedByBinOp (BinOp operator lhs rhs)
-- otherwise, return what we've got
<|> return lhs)
sample :: String
sample = "xyz + 15 - 123abc"
main :: IO ()
main = print $ Parsec.parse (parser <* eof) "failure" sample