antlrantlr4flex-lexer

How to achieve lex/flex-like start states with Antlr4 (or what are the proper semantics with Antlr4)


Distilled down to a very simple example, I have an input file with "name equals value" pairs. The name has restrictions on what characters are allowed, the value can have anything up to the new-line.

So the regular expression that matches a line would be something like this: [a-zA-Z0-9_]+=~[\r\n]+

Here's the Antlr4 grammar, which is not correct:

grammar example;

example_file
    : code* EOF
    ;

code
    : NAME '=' VALUE '\r'? '\n'
    | NAME '=' NAME '\r'? '\n'
    ;

NAME
    : [a-zA-Z0-9_]+
    ;

VALUE
    : ~[\r\n]+
    ;

Example input:

name1=value1
name2=[value2 with extra~ chars]

The online test ground (http://lab.antlr.org/) states '1:0 mismatched input 'name1=value1' expecting {, NAME}'

I believe the problem is that VALUE matches the entire string and is returned as one token by the lexer.

In (f)lex I would probably handle this by having a start state (e.g. %x VALUE), and so the lexer would keep the VALUE token exclusive to after the name and '=' have been recognized.

I've done quite a bit of Googling, but it's not clear to me how to handle this with Antlr4. (note again that this is a very distilled down example to focus on the main issue, which should be trivial - I would have written the code by hand if this is all that was needed ;-))

I've re-written the grammar several times, but it is becoming clear that I'm lacking some knowledge about Antlr. I did purchase the book.

Note that this question is similar, but the comments do not answer my question: I've Problems with ANTLR4 to parse key-value-pairs


Solution

  • One way to get something close to what you want is...

    grammar example;
    
    example_file
        : code* EOF
        ;
    
    code
        : NAME EQ (NAME | VALUE)+
        ;
    
    NEWLINE
        : [\r\n]+
        ->channel(HIDDEN)
        ;
    
    EQ
        : '='
        ;
    
    NAME
        : [a-zA-Z0-9_]+
        ;
    
    VALUE
        : ~[\r\n]+?
        ;
    
    

    ...but that makes the code rule a bit messy. One of the problems is that the VALUE token can include the equal sign and we have to use the non-greedy '?' modifier to get the VALUE rule to allow the EQ rule to work.

    Another option would be to split your grammar into a lexer grammar and a parser grammar (two separate files, I called them Example1Lexer.g4 and Example1Parser.g4) and use a lexer mode...

    lexer grammar Example1Lexer;
    
    NEWLINE
        : [\r\n]+
        ->channel(HIDDEN)
        ;
    
    EQ
        : '='
        ->pushMode(VALUE_MODE)
        ;
    
    NAME
        : [a-zA-Z0-9_]+
        ;
    
    mode VALUE_MODE;
    
    VALUE_MODE_NEWLINE
        : NEWLINE
        ->channel(HIDDEN),popMode
        ;
    
    VALUE
        : ~[\r\n]+
        ;
    
    parser grammar Example1Parser;
    
    options {tokenVocab=Example1Lexer;}
    
    example_file
        : code* EOF
        ;
    
    code
        : NAME EQ VALUE
        ;
    
    

    ...which might be a bit cleaner, or not, depending on your personal preferences.