I'm using ANTLR to parse scala files.
I found the grammar for scala langage here: https://github.com/antlr/grammars-v4/blob/master/scala/Scala.g4
I generated the ANTLR classes from the grammar thanks to the antlr4-maven-plugin.
<plugin>
<groupId>org.antlr</groupId>
<artifactId>antlr4-maven-plugin</artifactId>
<version>4.13.1</version>
<executions>
<execution>
<id>antlr-generate</id>
<phase>generate-sources</phase>
<goals>
<goal>antlr4</goal>
</goals>
</execution>
</executions>
<configuration>
<sourceDirectory>src/main/antlr4</sourceDirectory>
<outputDirectory>target/generated-sources/antlr4</outputDirectory>
<listener>true</listener>
<visitor>true</visitor>
</configuration>
</plugin>
I have a dependency on the runtime:
<dependency>
<groupId>org.antlr</groupId>
<artifactId>antlr4-runtime</artifactId>
<version>4.13.1</version>
</dependency>
Here's my code to parse a scala file:
public class Main {
public static void main(String[] args) throws IOException {
Path filePath = Paths.get(args[0]);
CharStream charStream = CharStreams.fromPath(filePath , StandardCharsets.UTF_8);
ScalaLexer lexer = new ScalaLexer(charStream);
CommonTokenStream tokens = new CommonTokenStream(lexer);
ScalaParser parser = new ScalaParser(tokens);
ParseTree tree = parser.compilationUnit();
ParseTreeWalker.DEFAULT.walk(new ScalaBaseListener(), tree);
}
}
I use this scala file sucessfully test.scala as input:
object replace extends RewriteRule {
def transform(node: Node): Seq[Node] = node match {
case <abc:name>{ value @ _* }</abc:name> => <abc:surname>{ value }</abc:surname>
case _ => node
}
}
However, this input with added override
keyword, fails:
object replace extends RewriteRule {
override def transform(node: Node): Seq[Node] = node match {
case <abc:name>{ value @ _* }</abc:name> => <abc:surname>{ value }</abc:surname>
case _ => node
}
}
The stderr says it finds unexpected characters such as :
from <abc:name>
.
line 3:13 extraneous input ':' expecting {'-', 'null', 'this', 'super', '(', '_', Id, BooleanLiteral, CharacterLiteral, SymbolLiteral, IntegerLiteral, StringLiteral, FloatingPointLiteral, Varid, NL}
line 3:19 extraneous input '{' expecting {'-', 'null', 'this', 'super', '(', '_', Id, BooleanLiteral, CharacterLiteral, SymbolLiteral, IntegerLiteral, StringLiteral, FloatingPointLiteral, Varid, NL}
line 3:27 mismatched input '@' expecting {'=>', 'if'}
line 3:30 extraneous input '*' expecting {'-', 'null', 'this', 'super', '(', '{', '}', 'type', 'val', '_', 'implicit', 'if', 'while', 'try', 'do', 'for', 'throw', 'return', '+', '~', '!', 'new', 'lazy', 'case', '@', 'var', 'override', 'abstract', 'final', 'sealed', 'private', 'protected', 'import', 'def', 'class', 'object', 'trait', Id, BooleanLiteral, CharacterLiteral, SymbolLiteral, IntegerLiteral, StringLiteral, FloatingPointLiteral}
The scala file is correct, it is a simplified version of the file I want to parse, which compiles.
What do I need to fix in the grammar ?
It appears the grammar does not handle XML literals, since the following code is successfully parsed:
object replace {
def process(node: Node): Seq[Node] = node match {
case a => 1
case _ => node
}
}
However, after a quick Google search, it appears XML literals are no longer supported and are replaced by XML string interpolation. So, to answer your question:
What do I need to fix in the grammar ?
The answer would be: make the lexer and parser recognize XML literals. A quick fix would be to add the lexer rule:
XmlLiteral
: '<' ~[ \t\r\n<>]+ '>' (XmlLiteral | ~[<>])*? '</' ~[ \t\r\n<>]+ '>'
;
and then add XmlLiteral
to the literal
parser rule:
literal
: '-'? IntegerLiteral
| '-'? FloatingPointLiteral
| BooleanLiteral
| CharacterLiteral
| StringLiteral
| SymbolLiteral
| 'null'
| XmlLiteral
;
Then your example input is properly parsed.
I say "quick fix" because that would cause the XML literal to be tokenized as a single token, without any structure. To have the XML properly parsed into a tree itself would need many more changes to both the lexer as parser grammars.
I'm sorry, just realized I made a mistake in my original post. I've just edited it, and appears to be something related to the override keyword
I understand the confusion, but that is not the issue. If you include the EOF
token in your compilationUnit
rule:
compilationUnit
: ('package' qualId)* topStatSeq EOF
;
and run your example again (the one you posted without the override
), you will see the following errors on your console:
line 1:35 mismatched input '{' expecting {<EOF>, 'implicit', 'lazy', 'case', '@', 'override', 'abstract', 'final', 'sealed', 'private', 'protected', 'import', 'class', 'object', 'trait', 'package'}
line 3:34 token recognition error at: '/a'
line 3:71 token recognition error at: '/a'
This is because by adding EOF
, you force the parser to consume all tokens. Whereas without the EOF
, only the couple first tokens are consumed and then the parser stops because it cannot cope with the XML literal. Try adding this:
System.out.println(tree.toStringTree(parser));
and you'll see that only this parse tree is printed:
With my proposed fix, the override
example also works.
And supporting attributes (up to a certain point), but the XML literal still being a single token, could look like this:
XmlLiteral
: XmlOpenTag (XmlLiteral | ~[<>])*? XmlCloseTag
;
fragment XmlOpenTag
: '<' ~[ \t\r\n<>]+ (S+ Attribute)* S* '>'
;
fragment XmlCloseTag
: '</' ~[ \t\r\n<>]+ '>'
;
fragment Attribute
: AttributeKey S* '=' S* AttributeValue
;
fragment AttributeKey
: [a-zA-Z_0-9]+
;
fragment AttributeValue
: AttributeKey
| '"' ~["]* '"'
| '\'' ~[']* '\''
;
fragment S
: [ \t\r\n]
;