Distilled down to a very simple example, I have an input file with "name equals value" pairs. The name has restrictions on what characters are allowed, the value can have anything up to the new-line.
So the regular expression that matches a line would be something like this:
[a-zA-Z0-9_]+=~[\r\n]+
Here's the Antlr4 grammar, which is not correct:
grammar example;
example_file
: code* EOF
;
code
: NAME '=' VALUE '\r'? '\n'
| NAME '=' NAME '\r'? '\n'
;
NAME
: [a-zA-Z0-9_]+
;
VALUE
: ~[\r\n]+
;
Example input:
name1=value1
name2=[value2 with extra~ chars]
The online test ground (http://lab.antlr.org/) states '1:0 mismatched input 'name1=value1' expecting {, NAME}'
I believe the problem is that VALUE matches the entire string and is returned as one token by the lexer.
In (f)lex I would probably handle this by having a start state (e.g. %x VALUE), and so the lexer would keep the VALUE token exclusive to after the name and '=' have been recognized.
I've done quite a bit of Googling, but it's not clear to me how to handle this with Antlr4. (note again that this is a very distilled down example to focus on the main issue, which should be trivial - I would have written the code by hand if this is all that was needed ;-))
I've re-written the grammar several times, but it is becoming clear that I'm lacking some knowledge about Antlr. I did purchase the book.
Note that this question is similar, but the comments do not answer my question: I've Problems with ANTLR4 to parse key-value-pairs
One way to get something close to what you want is...
grammar example;
example_file
: code* EOF
;
code
: NAME EQ (NAME | VALUE)+
;
NEWLINE
: [\r\n]+
->channel(HIDDEN)
;
EQ
: '='
;
NAME
: [a-zA-Z0-9_]+
;
VALUE
: ~[\r\n]+?
;
...but that makes the code rule a bit messy. One of the problems is that the VALUE token can include the equal sign and we have to use the non-greedy '?' modifier to get the VALUE rule to allow the EQ rule to work.
Another option would be to split your grammar into a lexer grammar and a parser grammar (two separate files, I called them Example1Lexer.g4 and Example1Parser.g4) and use a lexer mode...
lexer grammar Example1Lexer;
NEWLINE
: [\r\n]+
->channel(HIDDEN)
;
EQ
: '='
->pushMode(VALUE_MODE)
;
NAME
: [a-zA-Z0-9_]+
;
mode VALUE_MODE;
VALUE_MODE_NEWLINE
: NEWLINE
->channel(HIDDEN),popMode
;
VALUE
: ~[\r\n]+
;
parser grammar Example1Parser;
options {tokenVocab=Example1Lexer;}
example_file
: code* EOF
;
code
: NAME EQ VALUE
;
...which might be a bit cleaner, or not, depending on your personal preferences.