javayamlantlrantlr4parse-tree

ANTLR nested parse tree for yaml


I want to generate a nested parse tree for the below yaml sample file using ANTLR tool. I tried the below grammar, but for some reason its not properly displaying the nesting of nodes according to the yaml file.

yaml file is:

kind: Test
metadata:
  name: target
  labels:
    runnable: target
  annotations:
    message_value: Hi
    id: 1
    node_id: 2
    hex_id: 3

the ANTLR grammar I tried is:

grammar Sample;
yaml: (entry NEWLINE)* EOF;
entry: (keyValue | mapping);
keyValue: SPACE* KEY COLON SPACE* value SPACE* NEWLINE;
mapping: SPACE* KEY COLON SPACE* NEWLINE (nestedEntry)+;
nestedEntry: SPACE* keyValue | mapping;
value: STRING | NUMBER | (NEWLINE SPACE* mapping);
KEY: [a-zA-Z_]+[0-9]*[a-zA-Z_]*;
STRING: [a-zA-Z._]+;
NUMBER: [0-9]+;
NEWLINE: [\r\n]+;
SPACE: [ ] -> skip;
COLON: ':' -> skip;

The expected parse tree out put is like below:

  1. The 'yaml' root node should have two nodes 'kind' and 'metadata'
  2. The 'kind' node should have only one leaf node, ie 'Test'
  3. The 'metadata' node should have three nodes, 'name', 'labels' and 'annotation'
  4. The 'name' node should have one leaf node 'target'
  5. the 'labels' node should have a node 'runnable' which have a leaf node 'target'
  6. The 'annotation' node should have four nodes 'message_value', 'id', 'node_id', and 'hex_id' which again have the leaf nodes 'Hi', 1, 2, 3 respectively.

How can I achieve this proper parse tree.

Any idea what could be this issue in above grammar and how to resolve this?


Solution

  • Your KEY and STRING rules overlap too much, causing STRING to almost never get matched. With ANTLR, when 2 (or more) rules match the same, the one defined first "wins". So Test, target and hi will not get matched as a STRING, but as a KEY.

    Also, you're skipping SPACE and COLON in your lexer, making them unavailable in parser rules. COLON shouldn't be skipped in the first place.

    Try something like this instead:

    yaml         : entry (NEWLINE+ entry)* NEWLINE* EOF;
    entry        : (keyValue | mapping);
    keyValue     : KEY_OR_VALUE COLON value;
    mapping      : KEY_OR_VALUE COLON NEWLINE nestedEntry+;
    nestedEntry  : keyValue | mapping;
    value        : KEY_OR_VALUE+ | NUMBER | NEWLINE mapping;
    
    KEY_OR_VALUE : [a-zA-Z_]+ [a-zA-Z_0-9.]*;
    NUMBER       : [0-9]+;
    COLON        : ':';
    NEWLINE      : [\r\n]+;
    SPACE        : [ \t] -> skip;
    

    which will parse your example input like this:

    enter image description here

    I am sure you're aware of it, but writing an ANTLR grammar for YAML is rather tricky because of indentation. You could have a look here: https://github.com/umaranis/FastYaml

    EDIT

    Without being able to recognize indentations, you cannot make a distinction between:

    property:
      key:
        value: 1
        value: 2
    

    and

    property:
      key:
        value: 1
      value: 2
    

    Both value: 1 and value: 2 start with an indentation, but you have no way to recognize how many of them there are. I only showed how your grammar would be a valid ANTLR grammar. Your current grammar cannot easily be changed to support indentation recognition. You should study the FastYaml grammar I linked to.