parsingcompiler-constructionsyntax-highlightingtreesitter

Using tree-sitter as compiler's main parser


Can a parser, generated by tree-sitter, be used both for both syntax highlighting and compiler itself? If not - why?

It would be counterproductive to write 2 different parsers and maintain them.

Note: I haven't used tree-sitter yet, but consider using it for highlighting syntax of my own programming language. Due-to that, I may misunderstand how it's parser actually works.


Solution

  • Quoting the answer from https://github.com/tree-sitter/tree-sitter/discussions/831:

    I think the biggest downside to using a Tree-sitter parser in a compiler front-end is that, while we've done a lot of work on Tree-sitter's error recovery, we haven't yet built out functionality for error messages. So it isn't trivial to find out the exact token/position where the error initiated, and get a list of expected tokens, and things like that.

    Also, the error recovery currently isn't customizable in domain-specific ways (e.g. as soon as the word "function" appears, assume that the user meant to write an entire function definition).

    Down the road, I would love to invest in both of these things, but because there's so much other stuff we're working on, it may be a while before this happens.

    I managed to use a tree-sitter parser for a toy language to implement an interpreter in Rust: https://github.com/sgraf812/tree-sitter-lambda/blob/35fe05520e806548dedb48e7f97118847b531b26/src/main.rs

    Having done that, I can't recommend it:

    1. (Rust is a bit of a horrible language to do this, with all the cyclic references. There might be better ways, though.)
    2. There is no AST, and no means to generate one because tree-sitter does not allow specification of reduction actions (because that again would tie the meta language to the specification language, as is the case for bison and C). This means you have to switch over Node::kind, a string. Inefficient and incomplete matches everywhere.
    3. The syntax tree nodes only store ranges, not the associated source code string, leading to a bit of an unwieldy API, see the uses of ut8_text.

    I have a feeling that tree-sitter is best in class only when you don't need a typed overlay of the syntax tree.

    See also https://github.com/tree-sitter/tree-sitter/discussions/831#discussioncomment-5797368 for another experience report.