parsinglanguage-agnosticerror-recoveryglr

GLR parser with error recovery: too much alternatives when there are errors in input


Preamble

I have written GLR-parser with error recovery. When it encounters an error it splits into following alternatives:

  1. Insert the expected element into input (may be the user just missed it) and proceed as usual.
  2. Replace the erroneous element with expected one (may be the user just made a mistype) and proceed as usual.
  3. Skip erroneous element and if next element is also erroneous then go to #2.

But if input has a lot of errors (for example, user has by mistake given JPEG file to the parser) a number of alternatives grows exponentially.

Example

Such a parser corresponding to the following grammar:

Program -> Identifier WS Identifier WS '=' WS Identifier
Identifier -> ('a'..'z' | 'A'..'Z' | '0'..'9')*
WS -> ' '*

applied to the following text:

x = "abc\"def"; y = "ghi\"jkl";

fails with "out of memory" on moderately modern desktop computer.

Question

How to reduce number of alternatives in case of errors in input?


Solution

  • Doing GLR (parsing therefore) error correction at the character level is possible but aggravates your problem.

    The GLR error recovery procedure we use operates on tokens, so it isn't as bad.

    But when the input has a huge number of errors, it is pretty hard to recover. More sophisticated error recovery schemes basically use the parser to identify valid substrings of the language in the input, and then attempts to patch the substrings together to get the result. That's pretty ambitious.

    I've built GLR parsers with error recovery. I wasn't that ambitious. In general the parser mostly just aborts when the number of live parsers gets above "a large number" (e.g., 10,000) or the number of syntax errors encountered exceed a threshold (e.g, 10 or 20). You might consider aborting the parser if it hasn't advanced the input stream in the last second, which is an indirect sign it has too many live parsers.