treesitter

Tree sitter looking ahead, failing, but not reversing and trying another path


I'm writing a tree sitter grammar. It is parsing this code using one path until it hits an error, but instead of backtracking and trying another, it just says "error" and moves on to the next section to try and parse it. Here is the problem code:

match x with
    1 -> 1

If I have the following relevant grammar:

match: $ => 'match',
with: $ => 'with',
wordy_id: $ => 'x',
nat: $ => '1',
arrow_symbol: $ => '->'
_pattern_lhs: $ => choice($.nat),
_block: $ => ...

pattern: $ => choice(
  '1 -> 1'
)

Note _block denotes a layout_start (i.e., indent), some repeated rule with layout_semicolon (literal ; or a newline) between each rule, and then layout_end at the end, meaning a dedent. The indent and dedent are detected by an external C scanner.

The code as above parses the code as

(match) (wordy_id) (with)
  (pattern)

If I update pattern to

pattern: $ => choice(
  '1 -> 1',
  seq($._pattern_lhs, $.arrow_symbol, $.nat)
)

Then it gets evaluated successfully to

(match) (wordy_id) (with)
(pattern (nat) (arrow_symbol) (nat))

Excellent so far!

But now if I intentionally sabotage one of the choices:

pattern: $ => choice(
  '1 -> 1',
  seq($._pattern_lhs, $.arrow_symbol, 'howdydoody', $._block),
)

Now the seq option should fail but the 1 -> 1 should pass.

However the parsing just fails:

(match) (wordy_id) (with) (nat) (arrow_symbol) (ERROR [r1,c1] - [r2,c2])

If I move the sabotaged string back:

pattern: $ => choice(
  '1 -> 1',
  seq('lol', $._pattern_lhs, $.arrow_symbol, $._block)),

Then it does backtrack because it immediately sees 'lol' does not fit the text to be parsed.

Why is it not falling back to the 1 -> 1 option that absolutely should pass when I've got the 'howdydoody' sabotage? Shouldn't tree sitter be backtracking up the "so far successful" list and trying the alternative parsing? Instead it sems to attempt the "howdydoody" sabotaged one but never tries the 1 -> 1 string literal.


Solution

  • I discovered the cause was that if the scanner C code identifies a token and returns it, TS will not backtrack before that token.