parsing error-reporting peg parsing-error

Grammar: start: (a b)? a c; Input: a d. Which error correct at position 2? 1. expected "b", "c". OR expected "c"

Grammar:

rule: (a b)? a c ;

Input:

a d

Question: Which error message correct at position 2 for given input?

1. expected "b", "c".
2. expected "c".

P.S.

I write parser and I have choice (dilemma) take into account that "b" expected at position or not take.

The #1 error (expected "b", "c") want to say that input "a b" expected but because it optional it may not expected but possible.

I don't know possible is the same as expected or not?

Which error message better and correct #1 or #2?

Thanks for answers.

P.S.

In first case I define marker of testing as limit of position.

if(_inputPos > testing) {
  _failure(_inputPos, _code[cp + {{OFFSET_RESULT}}]);
}

Limit moved in optional expressions:

OPTIONAL_EXPRESSION:
testing = _inputPos;

The "b" expression move _inputPos above the testing pos and add failure at _inputPos.

In second case I can define marker of testing as boolean flag.

if(!testing) {
  _failure(_inputPos, _code[cp + {{OFFSET_RESULT}}]);
}

The "b" expression in this case not add failure because it tested (inner for optional expression).

What you think what is better and correct?

Testing defined as specific position and if expression above this position (_inputPos > testing) it add failure (even it inside optional expression).
Testing defined as flag and if this flag set that the failures not takes into account. After executing optional expression it restore (not reset!) previous value of testing (true or false).

Also failures not takes into account if rule not fails. They only reported if parsing fails.

P.S.

Changes at 06 Jan 2014

This question raised because it related to two different problems.

First problem:

Parsing expression grammar (PEG) describe only three atomic items of input:

terminal symbol
nonterminal symbol
empty string

This grammar does not provide such operation as lexical preprocessing an thus it does not provide such element as the token.

Second problem:

What is a grammar? Are two grammars can be considred equal if they accept the same input but produce different result?

Assume we have two grammar:

Grammar 1

rule <- type? identifier

Grammar 2

rule <- type identifier / identifier

They both accept the same input but produce (in PEG) different result.

Grammar 1 results:

{type : type, identifier : identifier}
{type : null, identifier : identifier}

Grammar 2 results:

{type : type, identifier : identifier}
{identifier : identifier}

Quetions:

Both grammar equal?
It is painless to do optimization of grammars?

My answer on both questions is negative. No equal, Not painless.

But you may ask. "But why this happens?".

I can answer to you. "Because this is not a problem. This is a feature".

In PEG parser expression ALWAYS consists from these parts.

ORDERED_CHOICE => SEQUENCE => EXPRESSION

And this explanation is the my answer on question "But why this happens?".

Another problem.

PEG parser does not recognize WHITESPACES because it does not have tokens and tokens separators.

Now look at this grammar (in short):

program <- WHITESPACE expr EOF

expr <- ruleX

ruleX <- 'X' WHITESPACE

WHITESPACE < ' '?

EOF <- ! .

All PEG grammar desribed in this manner.

First WHITESPACE at begin and other WHITESPACE (often) at the end of rule.

In this case in PEG optional WHITESPACE must be assumed as expected.

But WHITESPACE not means only space. It may be more complex [\t\n\r] and even comments.

But the main rule of error messages is the following.

If not possible to display all expected elements (or not possible to display at least one from all set of expected elements) in this case is more correct do not display anything.

More precisely required to display "unexpected" error mesage.

How you in PEG will display expected WHITESPACE?

Parser error: expected WHITESPACE
Parser error: expected ' ', '\t', '\n' , 'r'

What about start charcters of comments? They also may be part of WHITESPACE in some grammars.

In this case optional WHITESPACE will be reject all other potential expected elements because not possible correctly to display WHITESPACE in error message because WHITESPACE is too complex to display.

Is this good or bad?

I think this is not bad and required some tricks to hide this nature of PEG parsers.

And in my PEG parser I not assume that the inner expression at first position of optional (optional & zero_or_more) expression must be treated as expected. But all other inner (except at the first position) must treated as expected.

Example 1:

List<int list; // type? ident

Here "List<int" is a "type". But missing ">" is not at the first position in optional "type?".

This failure take into account and report as "expected '>'" This is because we not skip "type" but enter into "type" and after really optional "List" we move position from first to next real "expected" (that already outside of testing position) element.

"List" was in "testing" position.

If inner expression (inside optional expression) "fits in the limitation" not continue at next position then it not assumed as the expected input.

From this assumption has been asked main question.

You must just take into account that we are talking about PEG parsers and their error messages.

Solution

Here is your grammar:

What is clear here is that after the first a there are two possible inputs: b or c. Your error message should not prioritize one over the other.

The basic idea to produce an error message for an invalid input is to find the most far place you failed (if your grammar where d | (a b)? a c, d wouldn't be part of the error) and determine what are all possible inputs that could make you advance and say "expected '...' but got '...'". There are other approaches to try to recover the parser and force it to continue. If there is only one possible expected token, let's temporarily insert it into the token stream and continue as if it where there since ever. This would lead to better error detection as you can find errors beyond the point where the parser first stopped.