parsingrustpest

Pest doesn't parse a recursive grammar as I would expect


I'm using the pest crate to implement a recursive grammar in Rust:

id = _{ ASCII_ALPHA_LOWER ~ (ASCII_ALPHANUMERIC|"_")* }
integer = _{ (ASCII_NONZERO_DIGIT ~ ASCII_DIGIT*)|"0" }
real = _{ ((integer ~ "." ~ ASCII_DIGIT*) | (integer? ~ "." ~ ASCII_DIGIT+)) ~ (("e"|"E") ~ ("-"|"+")? ~ ASCII_DIGIT+)? }

unaryop = _{ "sin"|"cos"|"tan"|"exp"|"ln"|"sqrt" }

inner_exp = _{ real|integer|"pi"|id }

exp = { SOI ~ ( inner_exp | (exp ~ ( "+"|"-"|"*"|"/"|"^" ) ~ inner_exp) | ("-" ~ exp) | ("(" ~ exp ~ ")") | (unaryop ~ "(" ~ exp ~ ")") ) ~ EOI }

However, I'm finding that pest isn't parsing the grammar as I would expect. For instance, 2+3 gives me an error of:

 --> 1:2
  |
1 | 2+3
  |  ^---
  |
  = expected EOI

It appears that the inner_exp choice is being parsed and then, when the + symbol is encountered, the parser doesn't know what to do. I'm pretty sure there's a problem with how I've written the exp ~ ( "+"|"-"|"*"|"/"|"^" ) ~ inner_exp choice but I'm not sure what exactly is causing the problem. If I replace that choice with exp ~ ( "+"|"-"|"*"|"/"|"^" ) ~ exp I get an error stating that the expression is left-recursive. How do I fix this grammar?


Solution

  • The choice operator in PEGs is ordered and works as follows: Given e = {alt1 | alt2}:

    Now e = {e1 ~ e2} works as follows:

    So if you have something like e = {(e1 | e2) ~ e3}, the following will happen:

    Notably if e1 succeeds and e3 fails, it does not go back and try to match e2 instead. So if both e1 and e2 can produce a match, but only e2 allows e3 to be matched afterwards, (e1 | e2) ~ e3 will fail whereas (e1 ~ e3) | (e2 ~ e3) would succeed.

    So in your grammar you have (inner_exp | ...) ~ EOI. Now for your input inner_exp produces a match, so per the above rules the other alternatives are never tried and it tries to match EOI next. EOI doesn't match, so the whole rule fails and you get the syntax error you get.

    This explains the syntax error, but it's not the only problem your grammar has:

    Your exp rule is recursive, but it's anchored via SOI and EOI, so it can never match anything other than the entire input. This means that the recursive calls will necessarily always fail. To fix this you should remove SOI and EOI from the definition of exp and instead have a main rule like start = {SOI ~ exp ~ EOI}.

    Once you've done this, you'll get an error that your exp rule is now left-recursive, which pest does not support. To fix that, you can replace the left recursion with repetition like this (replacing both the inner_exp and exp ~ (...) ~ inner_exp alternatives) where operand is a rule that matches the constructs other than infix operations:

    operand ~ (( "+"|"-"|"*"|"/"|"^") ~ operand)*
    

    Incidentally this will also fix your current issue because you now no longer have an inner_exp alternative that's tried before the alternative for infix expressions.

    Your last issue is that you're not taking operator precedence into account at all. You can fix that by introducing additional "levels" of expressions in addition to inner_exp and exp, so that only operators that have the same precedence are defined in the same rule and then each rule invokes the rule containing the next higher precedence to parse the operands. That would look like this:

    exp = { summand ~ (("+" | "-") ~ summand)* }
    summand = { factor ~ (("*" | "/" | "%") ~ factor)* }
    factor = { unary ~ ("^" ~ unary)* }
    unary = { "-" ~ unary | unaryop ~ "(" ~ exp ~ ")" | primary }
    primary = { "(" ~ exp ~ ")" | real | integer | "pi" | id }