parsinggrammarebnfnearley

Generating a parser for expressions


I am trying to generate a javascript parser for the ebnf grammar described in this Microsoft article. The ebnf specified in the article does not work when I use it as its written, so I have tried to simplify it to make it work with the REx parser generator.

The goal is in Javascript to be able to parse and evaluate expressions like these to True or False:

I am using the REx parser generator online here: https://www.bottlecaps.de/rex/. If you have suggestions for other parser generators that produce JavaScript I would appreciate some links to where I can find them.

The issue I'm struggling with is the definition of the MethodCall. I have tried a lot of different definitions but all fail. When I remove the MethodCall and MethodArgs definition the REx parser generator produces a parser.

So I would appreciate any help to crack this problem a lot.

Below is the grammar as far as I have been able to get.

Expression
    ::= BinaryExpression | MethodCall | "(" Expression ")" | Number
BinaryExpression
    ::= RelationalExpression ( ( '=' | '!=' ) RelationalExpression )*
RelationalExpression
    ::= AdditiveExpression ( ( '<' | '>' | '<=' | '>=' | 'and' | 'or' ) AdditiveExpression )*
AdditiveExpression
    ::= MultiplicativeExpression ( ( '+' | '-' ) MultiplicativeExpression )*
MultiplicativeExpression
    ::= UnaryExpression ( ( '*' | '/' | '%' ) UnaryExpression )*
UnaryExpression
    ::= "!" Identifier | "+" Identifier | "-" Identifier | Identifier

MethodCall
    ::= Identifier "(" MethodArgs* ")"
MethodArgs
    ::= Expression | Expression "," MethodArgs

Identifier
    ::= Letter ( Letter | Digit | "_" )*
Number
    ::= Digit ('.' Digit) | (Digit)*

<?TOKENS?>

Letter
    ::= [A-Za-z]
Digit
    ::= [0-9]

Here are some of the different versions for the MethodCall definition I have tried but all of them fail.

MethodCall
    ::= Identifier "(" MethodArgs? ")"
MethodArgs
    ::= Expression ("," MethodArgs)*

MethodCall
    ::= Identifier "(" MethodArgs ")"
MethodArgs
    ::= ( Expression ("," MethodArgs)* )?

MethodCall
    ::= MethodName "(" MethodArgs? ")"
MethodName
    ::= Identifier
MethodArgs
    ::= Expression ("," MethodArgs)*

MethodCall
    ::= Identifier "(" MethodArgs ")"
MethodArgs
    ::= Expression ("," MethodArgs)* | ""

MethodCall
    ::= Identifier MethodArgs
MethodArgs
    ::= "(" (Expression ("," MethodArgs)* | "")  ")"

MethodCall
    ::= Identifier "(" MethodArgs ")"
MethodArgs
    ::= Expression | Expression "," MethodArgs | ""

MethodCall
    ::= Identifier "(" Expression ")"

I have tried to get inspiration from a number of other languages to see how they do it, but with no luck yet so far:

Update

Just wanted to let you know how this turned out. I struggled to get the EBNF grammar to do what I needed it to do, so I took a look at Nearley like @rici suggested and converted my grammar into the Nearley grammar syntax and got the tooling to work. It was just a much better choice for my project. The documentation is very good, the tooling is also great and the error messages is very helpful. So a huge thanks to @rici for suggesting Nearley.

Below is the grammar I have implemented. I have tested with the following inputs:

'2 + 4', '2 + 4 - 6', '(2 + 4)', '!true', '!(true)', 'hasCategory(test)', 'hasCategory(test,test2)', 'hasCategory( test , test2 )', 'hasCategory(test,test2, test3)', 'IsReference', 'IsReference()', '2 * 4', '(2 / 4)', '2 * 4 + 2', '(2 / 4) + 2', '2 > 4', '2 >= 2', '2 = 4', '2 == 2', '2 != 4', '2 !== 2', '(2 * 4 + 2) > 4', '(2 * 4 + 2) > (4 + 10)', 'true', 'true or false', 'true || false', 'true and false', 'true && false', '(true or false)', '!(true or false)', '2 != 1+1', '2 != (1+1)', '2 != (1+2)', '(2 > 2 or (2 != 1+1))',

@builtin "whitespace.ne" # `_` means arbitrary amount of whitespace
@builtin "number.ne"     # `int`, `decimal`, and `percentage` number primitives
@builtin "postprocessors.ne"

@{%
function methodCall(nameId, argsId = -1) {
  return function(data) {
      return {
          type: 'methodCall',
          name: data[nameId],
          args: argsId == -1 ? [] : data[argsId]
      };
    }
}

function value() {
  return function(data) {
      return {
          type: 'value',
          value: data[0]
      };
    }
}
%}
expression -> 
    methodCall {% id %}
  | relationalExpression {% value() %}
  | booleanExpression  {% value() %}
  | _ identifier _  {% methodCall(1) %}

booleanExpression ->
      parentheses {% id %}
    | parentheses _ "and"i _ parentheses {% d => d[0] && d[4] %}
    | parentheses _ "&&" _ parentheses {% d => d[0] && d[4] %}
    | parentheses _ "or"i _ parentheses {% d => d[0] || d[4] %}
    | parentheses _ "||" _ parentheses {% d => d[0] || d[4] %}

parentheses ->
    _ "(" relationalExpression ")" _ {% nth(2) %}
  |  _ "(" booleanExpression ")" _ {% nth(2) %}
  | unaryExpression {% id %}

relationalExpression -> 
      _ additiveExpression _ {% nth(1) %}
    | relationalExpression _ "=" _ additiveExpression {% d => d[0] == d[4] %}
    | relationalExpression _ "==" _ additiveExpression {% d => d[0] == d[4] %}
    | relationalExpression _ "!=" _ additiveExpression {% d => d[0] != d[4] %}
    | relationalExpression _ "!==" _ additiveExpression {% d => d[0] != d[4] %}
    | relationalExpression _ "<" _ additiveExpression {% d => d[0] < d[4] %}
    | relationalExpression _ ">" _ additiveExpression {% d => d[0] > d[4] %}
    | relationalExpression _ "<=" _ additiveExpression {% d => d[0] <= d[4] %}
    | relationalExpression _ ">=" _ additiveExpression {% d => d[0] >= d[4] %}

additiveExpression -> 
    _ multiplicativeExpression _ {% nth(1) %}
  | additiveExpression _ "+" _ multiplicativeExpression {% d => d[0] + d[4] %}
  | additiveExpression _ "-" _ multiplicativeExpression {% d => d[0] - d[4] %}

multiplicativeExpression ->
    _ parentheses _  {% nth(1) %}
  | parentheses _ "*" _ parentheses {% d => d[0] * d[4] %}
  | parentheses _ "/" _ parentheses {% d => d[0] / d[4] %}
  | parentheses _ "%" _ parentheses {% d => d[0] % d[4] %}

unaryExpression ->  
    _ "!" _ expression _ {% d => !d[3] %}
  | _ decimal _ {% nth(1) %}
  | _ unsigned_int _ {% nth(1) %}
  | _ boolean _ {% nth(1) %}
  | _ identifier _ {% nth(1) %}

methodCall -> 
      identifier "(" methodArgs ")" {% methodCall(0, 2) %}
    | identifier "(" _ ")" {% methodCall(0) %}

methodArgs ->
    _ identifier _  {% d => [d[1]] %}
  | _ identifier _ "," _ methodArgs _ {% d => [d[1]].concat(d[5]) %}

boolean ->
    "true"i {% () => true %} 
  | "false"i {% () => false %}

identifier -> 
  [A-Za-z0-9_]:+ {% (data, l, reject) => {
    var ident = data[0].join('');
    if (ident.toLowerCase() === 'true' || ident.toLowerCase() === 'false') {
      return reject;
    } else {
      return ident;
    }
  }
   %}

Solution

  • There are a few problems with your grammar, but it's mostly fine.

    Parser generators have a slightly different syntax for "EBNF", which is what REx takes. REx adds a <?TOKENS?> string to denote the boundary between parser and lexer rules. Microsoft says the grammar is "BNF" but it's not because it uses the Kleene operator <Identifier> ::= [^. ]*, an EBNF construct. It also fudges the definition of <Literal> and <Number> with prose.

    I haven't tested the generated parser, but it seems like a straightforward recursive descent implementation. The parser generators that I am familiar with, and that are popular, are listed in the conversion page. (I'm writing converters for all of them and many more.)

    Try this:

    Input ::= Expression EOF
    
    Expression
        ::= BinaryExpression | MethodCall | "(" Expression ")" | Number
    
    BinaryExpression
        ::= RelationalExpression ( ( '=' | '!=' ) RelationalExpression )*
    RelationalExpression
        ::= AdditiveExpression ( ( '<' | '>' | '<=' | '>=' | 'and' | 'or' ) AdditiveExpression )*
    AdditiveExpression
        ::= MultiplicativeExpression ( ( '+' | '-' ) MultiplicativeExpression )*
    MultiplicativeExpression
        ::= UnaryExpression ( ( '*' | '/' | '%' ) UnaryExpression )*
    UnaryExpression
        ::= "!" Identifier | "+" Identifier | "-" Identifier | Identifier
    
    MethodCall
        ::= Identifier "(" MethodArgs* ")"
    MethodArgs
        ::= Expression | Expression "," MethodArgs
    
    <?TOKENS?>
    
    Identifier
        ::= ( ( Letter ( Letter | Digit | "_" )* ) - ( 'and' | 'or' ) )
    Number ::= Digit ( ('.' Digit) | (Digit)* )
    
    Letter
        ::= [A-Za-z]
    Digit
        ::= [0-9]
    
    EOF      ::= $