antlr4abnf

ANTLR 4.1 Variable ANTLR 4 token multiplicity yields error: "closure with at least one alternative that can match empty string"


Basically what I'm trying to do is create a grammar for Internationalized Resource Identifiers in ANTLR 4.1. The hardest time I've had thus far is trying to get the production rule for ipv6address working correctly. The way ipv6address is defined in RFC 3987 is that there are basically 9 different alternatives in ABNF format for that production rule alone:

IPv6address    =                            6( h16 ":" ) ls32
              /                       "::" 5( h16 ":" ) ls32
              / [               h16 ] "::" 4( h16 ":" ) ls32
              / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
              / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
              / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
              / [ *4( h16 ":" ) h16 ] "::"              ls32
              / [ *5( h16 ":" ) h16 ] "::"              h16
              / [ *6( h16 ":" ) h16 ] "::" 

Here, ls32 and h16 are both subrules defined as:

ls32           = ( h16 ":" h16 ) / IPv4address

And as such for h16:

h16            = 1*4HEXDIG

Where HEXDIG is a lexer rule for valid hexadecimal digits. I've tried to write this ABNF grammar with ANTLR syntax like such:

grammar IRI;                                    


iri     : scheme ':' ihier_part ('?' iquery)? ('#' ifragment)? ;

ihier_part  : ('//' iauthority ipath_abempty
    | ipath_absolute
    | ipath_rootless)?
    ;

iri_reference   : iri                               
    | irelative_ref                         
    ;

absolute_IRI    : scheme ':' ihier_part ('?' iquery)? ;

irelative_ref   : irelative_part ('?' iquery)? ('#' ifragment)? ;

irelative_part  : ('//' iauthority ipath_abempty
    | ipath_absolute
    | ipath_noscheme)?
    ;

iauthority      : (iuserinfo '@')? ihost (':' port)? ;

iuserinfo       : (iunreserved | pct_encoded | sub_delims | ':')* ;

ihost           : ip_literal
    | ipv4address
    | ireg_name
    ;

ireg_name       : (iunreserved | pct_encoded | sub_delims)* ;

ipath   : (ipath_abempty                        
    | ipath_absolute                        
    | ipath_noscheme                        
    | ipath_rootless)?                      
    ;

ipath_abempty   : ('/' isegment)* ;

ipath_absolute  : '/' (isegment_nz ('/' isegment)*)? ;

ipath_noscheme  : isegment_nz_nc ('/' isegment)* ;

ipath_rootless  : isegment_nz ('/' isegment)* ;


isegment    : (ipchar)* ;

isegment_nz : (ipchar)+ ;

isegment_nz_nc  : (iunreserved | pct_encoded | sub_delims | '@')+ ;     

ipchar      : iunreserved
    | pct_encoded
    | sub_delims
    | ':'
    | '@'
    ;

iquery      : (ipchar | IPRIVATE | '/' | '?')* ;

ifragment   : (ipchar | '/' | '?')* ;

iunreserved : ALPHA
    | DIGIT
    | '-'
    | '.'
    | '_'
    | '~'
    | UCSCHAR
    ;

fragment
UCSCHAR     : '\u00A0'..'\uD7FF'   | '\uF900'..'\uFDCF'   | '\uFDF0'..'\uFFEF'  
    | '\u40000'..'\u4FFFD' | '\u50000'..'\u5FFFD' | '\u60000'..'\u6FFFD'
    | '\u70000'..'\u7FFFD' | '\u80000'..'\u8FFFD' | '\u90000'..'\u9FFFD'    
    | '\uA0000'..'\uAFFFD' | '\uB0000'..'\uBFFFD' | '\uC0000'..'\uCFFFD'
    | '\uD0000'..'\uDFFFD' | '\uE1000'..'\uEFFFD'
    ;

fragment
IPRIVATE    : '\uE000'..'\uF8FF' | '\uF0000'..'\uFFFFD' | '\u100000'..'\u10FFFD' ;

scheme      : ALPHA (ALPHA | DIGIT | '+' | '-' | '.')* ;

port        : (DIGIT)* ;

ip_literal  : '[' (ipv6address | ipvFuture) ']' ;

ipvFuture   : 'v' (HEXDIG)+ '.' (unreserved | sub_delims | ':')+ ;

ipv6address
locals [int i1, i2, i3, i4, i5, i6, i7, i8, i9, i10 = 0;]               
    : ( {$i1<=6}? h16 ':' {$i1++;} )* ls32                  
    | '::' ( {$i2<=5}? h16 ':' {$i2++;} )* ls32
    | (h16)? '::' ( {$i3<=4}? h16 ':' {$i3++;} )* ls32
    | ((h16 ':')? h16)? '::' ( {$i4<=3}? h16 ':'{$i4++;} )* ls32
    | (( {$i5>=0 && $i5<=2}? h16 ':' {$i5++;} )* h16)? '::' ( {$i6<=2}? h16 ':' {$i6++;} )* ls32
    | (( {$i7>=0 && $i7<=3}? h16 ':' {$i7++;} )* h16)? '::' h16 ':' ls32
    | (( {$i8>=0 && $i8<=4}? h16 ':' {$i8++;} )* h16)? '::' ls32
    | (( {$i9>=0 && $i9<=5}? h16 ':' {$i9++;} )* h16)? '::' h16
    | (( {$i10>=0 && $i10<=6}? h16 ':' {$i10++;} )* h16)* '::'
    ;

h16
locals [int i = 1;]
    : ( {$i>=1 && $i<=4}? HEXDIG {$i++;} )* ;       

ls32        : h16 ':' h16 ;

ipv4address : DEC_OCTET '.' DEC_OCTET '.' DEC_OCTET '.' DEC_OCTET ;

DEC_OCTET   : '0'..'9'                      
    | '10'..'99'
    | '100'..'199'
    | '200'..'249'
    | '250'..'255'
    ;

pct_encoded : '%' HEXDIG HEXDIG ;

unreserved  : ALPHA | DIGIT | '-' | '.' | '_' | '~' ;

reserved    : gen_delims
    | sub_delims
    ;

gen_delims  : ':' | '/' | '?' | '#' | '[' | ']' | '@' ;         

sub_delims  : '!' | '$' | '&' | '\'' | '(' | ')' ;              



DIGIT  : [0-9] ;                                
HEXDIG : [0-9A-F] ;
ALPHA  : [a-zA-Z] ;
WS     : [' ' | '\t' | '\r' | '\n']+ -> skip ;

In my ANTLR grammar, I'm trying to use semantic predicates in order to specify the multiplicity rules defined in the ABNF grammer, both for ipv6address and h16. When I execute the org.antlr.v4.Tool class, I get the following output:

warning(125): IRI.g4:68:20: implicit definition of token 'IPRIVATE' in parser
warning(125): IRI.g4:78:4: implicit definition of token 'UCSCHAR' in parser
error(153): IRI.g4:100:0: rule 'ipv6address' contains a closure with at least one alternative that can match an empty string
warning(154): IRI.g4:40:0: rule 'ipath' contains an optional block with at least one alternative that can match an empty string
warning(154): IRI.g4:100:0: rule 'ipv6address' contains an optional block with at least one alternative that can match an empty string
warning(154): IRI.g4:100:0: rule 'ipv6address' contains an optional block with at least one alternative that can match an empty string
warning(154): IRI.g4:100:0: rule 'ipv6address' contains an optional block with at least one alternative that can match an empty string
warning(154): IRI.g4:100:0: rule 'ipv6address' contains an optional block with at least one alternative that can match an empty string
warning(154): IRI.g4:100:0: rule 'ipv6address' contains an optional block with at least one alternative that can match an empty string
warning(154): IRI.g4:100:0: rule 'ipv6address' contains an optional block with at least one alternative that can match an empty string

Obviously I'd like to get rid of the warnings as well, but I need to get rid of the error stating 'ipv6address' contains a closure with at least one alternative that can match an empty string. I've seen similar posts on StackOverflow about multiple alternatives errors. However, none of them dealt with closures that could match the empty string. I also am pretty sure I'm going to have to define the Unicode characters in UCSCHAR past \uFFFF as surrogate pairs, but that I'll take care of later. Just need to know how to get rid of the closure problem for now.


Solution

  • There are quite some things going wrong:


    0

    What 280Z28 said.


    1

    '250'..'255' does not match the strings "250" ... "255": you need to match the numeric ranges as described in the original ABNF specs:

    ABNF

    dec-octet      = DIGIT                 ; 0-9
                   / %x31-39 DIGIT         ; 10-99
                   / "1" 2DIGIT            ; 100-199
                   / "2" %x30-34 DIGIT     ; 200-249
                   / "25" %x30-35          ; 250-255
    

    ANTLR

    dec_octet
     : digit
     | non_zero_digit digit
     | D1 digit digit
     | ...
     ;
    

    2

    You have a lot of conflicting lexer rules. Take these for example:

    HEXDIG : [0-9A-F] ;
    ALPHA  : [a-zA-Z] ;
    

    because HEXDIG is defined before ALPHA, the lexer will always create a HEXDIG when it sees 'A', for example. You must realize that the lexer does not produce tokens based on what the parser would like to receive. The lexer will go its own way and will never produce an ALPHA for the uppercase letters A-F.


    3

    fragment rules can only be used inside other lexer rules (or other fragment rules). You cannot use them inside parser rules.


    4

    Not really an issue, but the predicates make your grammar hard to read: if possible try to minimize predicates is my rule of thumb.

    Your rule:

    h16
    locals [int i = 1;]
        : ( {$i>=1 && $i<=4}? HEXDIG {$i++;} )* ;
    

    could be written as:

    h16
     : HEXDIG HEXDIG HEXDIG HEXDIG
     | HEXDIG HEXDIG HEXDIG
     | HEXDIG HEXDIG
     | HEXDIG
     ;
    

    or even:

    h16
     : HEXDIG (HEXDIG (HEXDIG HEXDIG?)?)?
     ;
    


    Most of these issues are easily fixed, but #2 is a more tricky one. What you could (should?) do is let the lexer create single-char tokens and let the parser match these single-char tokens into a whole. An example how you could let the parser match the dec-octet production from the official ABNF:

    dec_octet
     : digit                               // 0-9
     | non_zero_digit digit                // 10-99
     | D1 digit digit                      // 100-199
     | D2 (D0 | D1 | D2 | D3 | D4) digit   // 200-249
     | D2 D5 (D0 | D1 | D2 | D3 | D4 | D5) // 250-255
     ;
    
    digit
     : D0
     | non_zero_digit
     ;
    
    non_zero_digit
     : D1 | D2 | D3 | D4 | D5 | D6 | D7 | D8 | D9
     ;
    
    // lexer rules
    D0 : '0';
    D1 : '1';
    D2 : '2';
    D3 : '3';
    D4 : '4';
    D5 : '5';
    D6 : '6';
    D7 : '7';
    D8 : '8';
    D9 : '9';
    

    I've once written an IRI grammar for ANTLR 3. If you want, I could put it in Github somewhere.