javascriptparsingnearley

[Nearley]: how to parse matching opening and closing tag


I'm trying to parse a very simple language with nearley: you can put a string between matching opening and closing tags, and you can chain some tags. It looks like a kind of XML, but with[ instead of < , with tag always 2 chars long, and without nesting.

[aa]My text[/aa][ab]Another Text[/ab]

But I don't seem to be able to parse correctly this, as I get the grammar should be unambiguous as soon as I have more than one tag.

The grammar that I have right now:

@builtin "string.ne"
@builtin "whitespace.ne"

openAndCloseTag[X] -> "[" $X "]" string  "[/" $X "]"

languages -> openAndCloseTag[[a-zA-Z] [a-zA-Z]] (_ openAndCloseTag[[a-zA-Z] [a-zA-Z]]):*

string -> sstrchar:* {% (d) => d[0].join("") %}

And related, Ideally I would like the tags to be case insensitive (eg. [bc]TESt[/BC] would be valid)

Has anyone any idea how we can do that? I wasn't able to find a nearley XML parser example .


Solution

  • Your language is almost too simple to need a parser generator. And at the same time, it is not context free, which makes it difficult to use a parser generator. So it is quite possible that the Nearly parser is not the best tool for you, although it is probably possible to make it work with a bit of hackery.

    First things first. You have not actually provided an unambiguous definition of your language, which is why your parser reports an ambiguity. To see the ambiguity, consider the input

    [aa]My text[/ab][ab]Another Text[/aa]
    

    That's very similar to your test input; all I did was swap a pair of letters. Now, here's the question: Is that a valid input consisting of a single aa tag? Or is it a syntax error? (That's a serious question. Some definitions of tagging systems like this consider a tag to only be closed by a matching close tag, so that things which look like different tags are considered to be plain text. Such systems would accept the input as a single tagged value.)

    The problem is that you define string as sstrchar:*, and if we look at the definition of sstrchar in string.ne, we see (leaving out the postprocessing actions, which are irrelevant):

    sstrchar -> [^\\'\n]
        | "\\" strescape
        | "\\'"
    

    Now, the first possibility is "any character other than a backslash, a single quote or a newline", and it's easy to see that all of the characters in [/ab] are in sstrchar. (It's not clear to me why you chose sstrchar; single quotes don't appear to be special in your language. Or perhaps you just didn't mention their significance.) So a string could extend up to the end of the input. Of course, the syntax requires a closing tag, and the Nearley parser is determined to find a match if there is one. But, in fact, there are two of them. So the parser declares an ambiguity, since it doesn't have any criterion to choose between the two close tags.

    And here's where we come up against the issue that your language is not context-free. (Actually, it is context-free in some technical sense, because there are "only" 676 two-letter case-insensitive tags, and it would theoretically be possible to list all 676 possibilities. But I'm guessing you don't want to do that.)

    A context-free grammar cannot express a language that insists that two non-terminals expand to the same string. That's the very definition of context-free: if one non-terminal can only match the same input as a previous non-terminal, then the second non-terminals match is dependent on the context, specifically on the match produced by the first non-terminal. In a context-free grammar, a non-terminal expands to the same thing, regardless of the rest of the text. The context in which the non-terminal appears is not allowed to influence the expansion.

    Now, you quite possibly expected that your macro definition:

    openAndCloseTag[X] -> "[" $X "]" string  "[/" $X "]"
    

    is expressing a context-sensitive match by repeating the $X macro parameter. But it is not by accident that the Nearley documentation describes this construct as a macro. X here refers exactly to the string used in the macro invocation. So when you say:

    openAndCloseTag[[a-zA-Z] [a-zA-Z]]
    

    Nearly macro expands that to

     "[" [a-zA-Z] [a-zA-Z] "]" string  "[/" [a-zA-Z] [a-zA-Z] "]"
    

    and that's what it will use as the grammar production. Observe that the two $X macro parameters were expanded to the same argument, but that doesn't mean that will match the same input text. Each of those subpatterns will independently match any two alphabetic characters. Context-freely.

    As I alluded to earlier, you could use this macro to write out the 676 possible tag patterns:

    tag -> openAndCloseTag["aa"i]
         | openAndCloseTag["ab"i]
         | openAndCloseTag["ac"i]
         | ...
         | openAndCloseTag["zz"i]
    

    If you did that (and you managed to correctly list all of the possibilities) then the parser would not complain about ambiguity as long as you never use the same tag twice in the same input. So it would be ok with both your original input and my altered input (as long as you accept the interpretation that my input is a single tagged object). But it would still report the following as ambiguous:

    [aa]My text[/aa][aa]Another Text[/aa]
    

    That's ambiguous because the grammar allows it to be either a single aa tagged string (whose text includes characters which look like close and open tags) or as two consecutive aa tagged strings.

    To eliminate the ambiguity you would have to write the string pattern in a way which does not permit internal tags, in the same way that sstrchar doesn't allow internal single quotes. Except, of course, it is not nearly so simple to match a string which doesn't contain a pattern, than to match a string which doesn't contain a single character. It could be done using Nearley, but I really don't think that it's what you want.

    Probably your best bet is to use native Javascript regular expressions to match tagged strings. This will prove simpler because Javascript regular expressions are much more powerful than mathematical regular expressions, even allowing the possibility of matching (certain) context-sensitive constructions. You could, for example, use Javascript regular expressions with the Moo lexer, which integrates well into Nearley. Or you could just use the regular expressions directly, since once you match the tagged text, there isn't much else you need to do.

    To get you started, here's a simple Javascript regular expression which matches tagged strings with matching case-insensitive labels (the i flag at the end):

    /\[([a-zA-Z]{2})\].*?\[\/\1\]/gmi
    

    You can play with it online using Regex 101