javascalaantlrantlr4

My scala file fails Scala.g4 ANTLR grammar


I'm using ANTLR to parse scala files.

I found the grammar for scala langage here: https://github.com/antlr/grammars-v4/blob/master/scala/Scala.g4

I generated the ANTLR classes from the grammar thanks to the antlr4-maven-plugin.

<plugin>
    <groupId>org.antlr</groupId>
    <artifactId>antlr4-maven-plugin</artifactId>
    <version>4.13.1</version>
    <executions>
        <execution>
            <id>antlr-generate</id>
            <phase>generate-sources</phase>
            <goals>
                <goal>antlr4</goal>
            </goals>
        </execution>
    </executions>
    <configuration>
        <sourceDirectory>src/main/antlr4</sourceDirectory>
        <outputDirectory>target/generated-sources/antlr4</outputDirectory>
        <listener>true</listener>
        <visitor>true</visitor>
    </configuration>
</plugin>

I have a dependency on the runtime:

<dependency>
    <groupId>org.antlr</groupId>
    <artifactId>antlr4-runtime</artifactId>
    <version>4.13.1</version>
</dependency>

Here's my code to parse a scala file:

public class Main {

    public static void main(String[] args) throws IOException {
        Path filePath = Paths.get(args[0]);
        CharStream charStream = CharStreams.fromPath(filePath , StandardCharsets.UTF_8);
        ScalaLexer lexer = new ScalaLexer(charStream);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        ScalaParser parser = new ScalaParser(tokens);
        ParseTree tree = parser.compilationUnit();
        ParseTreeWalker.DEFAULT.walk(new ScalaBaseListener(), tree);
    }
}

I use this scala file sucessfully test.scala as input:

object replace extends RewriteRule {
  def transform(node: Node): Seq[Node] = node match {
    case <abc:name>{ value @ _* }</abc:name> => <abc:surname>{ value }</abc:surname>
    case _ => node
  }
}

However, this input with added override keyword, fails:

object replace extends RewriteRule {
  override def transform(node: Node): Seq[Node] = node match {
    case <abc:name>{ value @ _* }</abc:name> => <abc:surname>{ value }</abc:surname>
    case _ => node
  }
}

The stderr says it finds unexpected characters such as : from <abc:name>.

line 3:13 extraneous input ':' expecting {'-', 'null', 'this', 'super', '(', '_', Id, BooleanLiteral, CharacterLiteral, SymbolLiteral, IntegerLiteral, StringLiteral, FloatingPointLiteral, Varid, NL}
line 3:19 extraneous input '{' expecting {'-', 'null', 'this', 'super', '(', '_', Id, BooleanLiteral, CharacterLiteral, SymbolLiteral, IntegerLiteral, StringLiteral, FloatingPointLiteral, Varid, NL}
line 3:27 mismatched input '@' expecting {'=>', 'if'}
line 3:30 extraneous input '*' expecting {'-', 'null', 'this', 'super', '(', '{', '}', 'type', 'val', '_', 'implicit', 'if', 'while', 'try', 'do', 'for', 'throw', 'return', '+', '~', '!', 'new', 'lazy', 'case', '@', 'var', 'override', 'abstract', 'final', 'sealed', 'private', 'protected', 'import', 'def', 'class', 'object', 'trait', Id, BooleanLiteral, CharacterLiteral, SymbolLiteral, IntegerLiteral, StringLiteral, FloatingPointLiteral}

The scala file is correct, it is a simplified version of the file I want to parse, which compiles.

What do I need to fix in the grammar ?


Solution

  • It appears the grammar does not handle XML literals, since the following code is successfully parsed:

    object replace {
      def process(node: Node): Seq[Node] = node match {
        case a => 1
        case _ => node
      }
    }
    

    However, after a quick Google search, it appears XML literals are no longer supported and are replaced by XML string interpolation. So, to answer your question:

    What do I need to fix in the grammar ?

    The answer would be: make the lexer and parser recognize XML literals. A quick fix would be to add the lexer rule:

    XmlLiteral
     : '<' ~[ \t\r\n<>]+ '>' (XmlLiteral | ~[<>])*? '</' ~[ \t\r\n<>]+ '>'
     ;
    

    and then add XmlLiteral to the literal parser rule:

    literal
        : '-'? IntegerLiteral
        | '-'? FloatingPointLiteral
        | BooleanLiteral
        | CharacterLiteral
        | StringLiteral
        | SymbolLiteral
        | 'null'
        | XmlLiteral
        ;
    

    Then your example input is properly parsed.

    I say "quick fix" because that would cause the XML literal to be tokenized as a single token, without any structure. To have the XML properly parsed into a tree itself would need many more changes to both the lexer as parser grammars.

    EDIT

    I'm sorry, just realized I made a mistake in my original post. I've just edited it, and appears to be something related to the override keyword

    I understand the confusion, but that is not the issue. If you include the EOF token in your compilationUnit rule:

    compilationUnit
        : ('package' qualId)* topStatSeq EOF
        ;
    

    and run your example again (the one you posted without the override), you will see the following errors on your console:

    line 1:35 mismatched input '{' expecting {<EOF>, 'implicit', 'lazy', 'case', '@', 'override', 'abstract', 'final', 'sealed', 'private', 'protected', 'import', 'class', 'object', 'trait', 'package'}
    line 3:34 token recognition error at: '/a'
    line 3:71 token recognition error at: '/a'
    

    This is because by adding EOF, you force the parser to consume all tokens. Whereas without the EOF, only the couple first tokens are consumed and then the parser stops because it cannot cope with the XML literal. Try adding this:

    System.out.println(tree.toStringTree(parser));
    

    and you'll see that only this parse tree is printed:

    enter image description here

    With my proposed fix, the override example also works.

    EDIT 2

    And supporting attributes (up to a certain point), but the XML literal still being a single token, could look like this:

    XmlLiteral
     : XmlOpenTag (XmlLiteral | ~[<>])*? XmlCloseTag
     ;
    
    fragment XmlOpenTag
     : '<' ~[ \t\r\n<>]+ (S+ Attribute)* S* '>'
     ;
    
    fragment XmlCloseTag
     : '</' ~[ \t\r\n<>]+ '>'
     ;
    
    fragment Attribute
     : AttributeKey S* '=' S* AttributeValue
     ;
    
    fragment AttributeKey
     : [a-zA-Z_0-9]+
     ;
    
    fragment AttributeValue
     : AttributeKey
     | '"' ~["]* '"'
     | '\'' ~[']* '\''
     ;
    
    fragment S
     : [ \t\r\n]
     ;