antlrxtexterror-recovery

How to fix? Xtext grammar stops parsing with 'no viable alternative at input ...' on incorrect input


As an Xtext and Antlr newbie, I'm struggling with getting an error-tolerant Xtext grammar for a very simple subset of a (not JVM related) language I want to parse.

A document in this mini-language could look like this:

$c wff |- $.
$c class $.
$c set $.

So a sequence of statements surrounded by $c and $. keywords, with inbetween one or more words that may not contain $. And everything separated by mandatory whitespace.

The best I can come up with is the following grammar:

grammar mm.ecxt.MMLanguage

import "http://www.eclipse.org/emf/2002/Ecore" as ecore

generate mmLanguage "urn:marnix:mm.exct/MMLanguage"

MMDatabase:
    WS? (statements+=statement WS)* statements+=statement WS?;

statement:
    DOLLAR_C WS (symbols+=MATHSYMBOL WS)+ DOLLAR_DOT;

terminal DOLLAR_C: '$c';
terminal DOLLAR_DOT: '$.';
terminal MATHSYMBOL: 
      ('!'..'#'|'%'..'~')+; /* everything except '$' */

terminal WS : (' '|'\t'|'\r'|'\n')+;

terminal WORD: ('!'..'~')+;

On valid input this grammar works fine. However, on invalid input, like

$c class $.
$c $.
$c set $.
$c x$u $.

there is just one error (no viable alternative at input '$.'), and after that it looks like parsing just stops: no more errors are detected, and the model just contains the correct statements before the error (here only the class statement).

I tried all kinds of variations (using =>, with/without terminal declarations, enabling backtracking, and more) but all I get is no viable alternative at input ....

So my question is: How should I write a grammar for this language so that Antlr does some form of error recovery? Or is there something else that I'm doing wrong?

From, e.g., http://zarnekow.blogspot.de/2012/11/xtext-corner-7-parser-error-recovery.html I expected that this would work out of the box. Or is this because I'm not using a Java/C-like grammar based on Xbase?


Solution

  • What seems to happen here is that in line 2 of your sample input, two tokens are missing according to your grammar: The parser expects a (symbols+=MATHSYMBOL WS)+ but get $.. Antlr will happily try to recover with different strategies, some are working locally and others are working on a per parser rule basis. Antlr will not insert two recovery tokens to finish the rule statement but it'll bail out from there. After the statement, a mandatory WS is expected but it sees $. so it'll bail out again. That's why it appears to not recover at all. Well all of this is more or less an educated guess.

    What will help though is a minor grammar refactoring where you do not make the grammar as strict as it currently is. Some optional tokens will help the parser to recover:

    MMDatabase:
        WS? (statements+=statement WS?)*;
    
    statement:
        DOLLAR_C WS (symbols+=MATHSYMBOL WS?)* DOLLAR_DOT;