As an Xtext and Antlr newbie, I'm struggling with getting an error-tolerant Xtext grammar for a very simple subset of a (not JVM related) language I want to parse.
A document in this mini-language could look like this:
$c wff |- $.
$c class $.
$c set $.
So a sequence of statements surrounded by $c
and $.
keywords, with inbetween one or more words that may not contain $
. And everything separated by mandatory whitespace.
The best I can come up with is the following grammar:
grammar mm.ecxt.MMLanguage
import "http://www.eclipse.org/emf/2002/Ecore" as ecore
generate mmLanguage "urn:marnix:mm.exct/MMLanguage"
MMDatabase:
WS? (statements+=statement WS)* statements+=statement WS?;
statement:
DOLLAR_C WS (symbols+=MATHSYMBOL WS)+ DOLLAR_DOT;
terminal DOLLAR_C: '$c';
terminal DOLLAR_DOT: '$.';
terminal MATHSYMBOL:
('!'..'#'|'%'..'~')+; /* everything except '$' */
terminal WS : (' '|'\t'|'\r'|'\n')+;
terminal WORD: ('!'..'~')+;
On valid input this grammar works fine. However, on invalid input, like
$c class $.
$c $.
$c set $.
$c x$u $.
there is just one error (no viable alternative at input '$.'
), and after that it looks like parsing just stops: no more errors are detected, and the model just contains the correct statements before the error (here only the class
statement).
I tried all kinds of variations (using =>
, with/without terminal
declarations, enabling backtracking, and more) but all I get is no viable alternative at input ...
.
So my question is: How should I write a grammar for this language so that Antlr does some form of error recovery? Or is there something else that I'm doing wrong?
From, e.g., http://zarnekow.blogspot.de/2012/11/xtext-corner-7-parser-error-recovery.html I expected that this would work out of the box. Or is this because I'm not using a Java/C-like grammar based on Xbase?
What seems to happen here is that in line 2 of your sample input, two tokens are missing according to your grammar: The parser expects a (symbols+=MATHSYMBOL WS)+
but get $.
. Antlr will happily try to recover with different strategies, some are working locally and others are working on a per parser rule basis. Antlr will not insert two recovery tokens to finish the rule statement
but it'll bail out from there. After the statement, a mandatory WS
is expected but it sees $.
so it'll bail out again. That's why it appears to not recover at all.
Well all of this is more or less an educated guess.
What will help though is a minor grammar refactoring where you do not make the grammar as strict as it currently is. Some optional tokens will help the parser to recover:
MMDatabase:
WS? (statements+=statement WS?)*;
statement:
DOLLAR_C WS (symbols+=MATHSYMBOL WS?)* DOLLAR_DOT;