javascriptparsingnearley

Nearley parser grammar for parsing opening and closing tags


Say I had a simple language to parse in nearley that's just made of strings. "this is a string"

string -> "\"" chars "\""

However, that string can contain a code within curly braces. To keep things simple let's just say the code can only be another string."this is a string with {"code"}"

code -> "{" string "}"

How do I define the new string in Nearley to include the code definition? I keep ending up with a huge number of results as chars can match one or more characters.

string -> "\"" charCode "\""

charCode -> (chars | code) charCode
| (chars | code)

code -> "{" string "}"

chars -> char chars
| char
char -> [^{}]

Ideally I'd be able to turn something like this "chars {"code"} chars chars {"code"} chars" into an array ["chars ", "code", " chars chars ", "code", " chars"]

Perhaps it's only possible to do this using regex and moo as suggested in this answer? (The opening and closing tags are less ambiguous in this example, and I'm not experiencing the same issues.) [Nearley]: how to parse matching opening and closing tag


Solution

  • I'd use a regex-based lexer, certainly. But you could try to write an unambiguous grammar, based on the observation that you can never have two adjacent chars in a charCode:

    string -> "\"" charCodeStart chars:? "\""
    charCodeStart -> 
                   | charCodeStart chars:? code
    

    Another possibility, using EBNF:

    string -> "\"" ( char:* code ):* char:* "\""
    

    You'll probably have to play with that a bit to get it right. I don't use nearley much.