bnfearley-parsernearley

How can I write an unambiguous nearley grammar for boolean search operators


The Context

I am climbing the Nearley learning curve and trying to write a grammar for a search query parser.

The Goal

I would like to write grammar that is able to parse a querystring that contains boolean operators (e.g. AND, OR, NOT). Lets use AND for this question as a trivial case.

For instance, the grammar should recognize these example strings as valid:

The Attempt

My naive attempt looks something like this:

query -> 
    statement
  | statement "AND" statement

statement -> .:+

The Problem

The above grammar attempt is ambiguous because .:+ will match literally any string. What I really want is for the first condition to match any string that does not contain AND in it. Once "AND" appears I want to enter the second condition only.

The Question

How can I detect these two distinct cases without having ambiguous grammar?

I am worried I'm missing something fundamental; I can imagine a ton of use cases where we want arbitrary text split up by known operators.


Solution

  • Yeah, if you've got an escape hatch that could be literally anything, you're going to have a problem.

    Somewhere you're going to want to define what your base set of tokens are, at least something like \S+ and then how those tokens can be composed.

    The place I'd typically start for a parser is trying to figure out where recursion is accounted for in the parser, and what approach to parsing the lib you're relying on takes.

    Looks like Nearley is an Earley parser, and as the wikipedia entry for them notes, they're efficient for left-recursion.

    This is just hazarding a guess, but something like this might get you to conjunction at least.

    CONJUNCTION -> AND | OR
    STATEMENT -> TOKENS | (TOKENS CONJUNCTION STATEMENT)
    TOKENS -> [^()]+
    

    A structure like this should be unambiguous and bans parentheses in tokens, unless they're surrounded by double quotes.