parsingocamlgrammarmenhir

Parser: Distinguish parenthesis and function declaration in JavaScript like grammar


I'm working on a compiler using OCaml and Menhir as parser and lexer. When I'm writing a JavaScript like Grammar, with (a, b) => a + b like lambda function definition and also arithmetics (a + b) * c with parenthesis prioritizing sub-expressions, I write

expr
: x = ID { EId(x) }
...
| LPAREN; e = expr; RPAREN; { e }
...
| LPAREN; args = separated_list(COMMA, ID); RPAREN; ARROW; body = expr; { EFunction(args, body) }

in parser.mly.

Just adding few more context, I have my lexer.mll like this:

let letter = ['a'-'z' 'A'-'Z']
let lud = ['a'-'z' 'A'-'Z' '_' '0'-'9']
let id = letter lud*

rule read = parse
    ...
    | "(" { LPAREN }
    | ")" { RPAREN }
    | "," { COMMA }
    | "=>" { ARROW }
    ...
    | id { ID (Lexing.lexeme lexbuf) }
    ...

But this is going to get a reduce/reduce error when compile:

Error: do not know how to resolve a reduce/reduce conflict between the following two productions:
expr -> ID
separated_nonempty_list(COMMA,ID) -> ID

I know that the problem is probably caused by the ambiguity between these two: (a) and (a) => a (one-arg-function). But I still couldn't understand why it is still producing this error since to me it is super clear that the parenthesis followed by a fat arrow => is going to be a function...

Can someone help me a bit on this?


Solution

  • But I still couldn't understand why it is still producing this error since to me it is super clear that the parenthesis followed by a fat arrow => is going to be a function...

    Yes, it is super clear. The grammar is totally unambiguous. But you are not limited to looking at the input one token at a time, whereas an LR(1) parser is. At the moment when the parser is trying to decide what to do about the a in (a), it cannot yet seen the fat arrow and it has to make its decision before it does. That is, before consuming the ), the parser needs to decide whether what comes before it is an expr or a separated_nonempty_list.

    It's possibly worth noting that the grammar is actually LR(2): one more lookahead token, and the conflict could be resolved. That's not much consolation since Menhir does not provide any mechanism for increased lookahead, but it does mean that a solution exists, because the existence of an LR(k) grammar for a language implies the existence of an LR(1) grammar for the same language; there's even a mechanical algorithm to produce the LR(1) grammar.

    Rather than transforming the entire grammar, an only slightly messy solution is to isolate the '(a)` cases, which can be done with a pair of apparently redundant rules:

    expr: LPAREN ID RPAREN ARROW expr
        | LPAREN ID RPAREN
    

    The second production obviously conflicts with LPAREN expr RPAREN, but it's a shift/reduce conflict rather than a reduce/reduce conflict and it can be resolved by giving RPAREN a higher precedence than ID in order to force resolution in favour of shifting RPAREN.

    That's a complete perversion of precedence declarations, and it may well prove to be problematic as your grammar gets more complicated. A more theoretically sound solution would be to define expr_other_than_identifier. You can find an example of that here, which is a very similar grammar, and in some other SO answers to similar questions.

    In particular, Yacc's own grammar is LR(2) (you can't tell that a rule has ended until you see the : following the non-terminal which starts the next rule). A similar solution exists for this grammar, but most yacc-like parser generators solve the problem by pushing the extra lookahead into the lexer analyzer. For example, you could recognize ) => as a single token (including the internal whitespace), or you could issue a different close parenthesis token if the next token were a fat arrow.