parsingantlrantlr4grammar

adding support for arrays in my antlr grammar


I am using ANTLR to parse a search query input like:

age > 25

or

firstname:"john"

The tree typically looks like this:

parse tree

I want to add support for IN operator and array values, to be able to parse queries like:

firstname IN ["jane", "joe"]

I tried to add a syntactic rule for arrays in my value rule.

array
   : LBRACKET (value (COMMA value)* )? RBRACKET
   ;
   
value
   : array
   | IDENTIFIER
   | STRING
   | ENCODED_STRING
   | NUMBER
   | BOOL
  ;

The problem is that the parser recognize the array itself as an ENCODED_STRING.

firstname IN ["john", "jane"]

parse tree with array in input

Here is my full grammar file

grammar Query;

// syntactic rules
input
   : query EOF
   ;

query
   : left=query logicalOp=(AND | OR) right=query #opQuery
   | LPAREN query RPAREN #priorityQuery
   | criteria #atomQuery
   ;

criteria
   : key op value
   ;

key
   : IDENTIFIER
   ;

array
   : LBRACKET (value (COMMA value)* )? RBRACKET
   ;
   
value
   : array
   | IDENTIFIER
   | STRING
   | ENCODED_STRING
   | NUMBER
   | BOOL
  ;

op
   : EQ
   | GT
   | LT
   | NOT_EQ
   | IN
   | NOT_IN
   ;

// lexical rules
BOOL
    : 'true'
    | 'false'
    ;

STRING
 : '"' DoubleStringCharacter* '"'
 | '\'' SingleStringCharacter* '\''
 ;

fragment DoubleStringCharacter
   : ~["\\\r\n]
   | '\\' EscapeSequence
   | LineContinuation
   ;

fragment SingleStringCharacter
    : ~['\\\r\n]
    | '\\' EscapeSequence
    | LineContinuation
    ;

fragment EscapeSequence
    : CharacterEscapeSequence
    | HexEscapeSequence
    | UnicodeEscapeSequence
    ;

fragment CharacterEscapeSequence
 : SingleEscapeCharacter
 | NonEscapeCharacter
 ;

fragment HexEscapeSequence
 : 'x' HexDigit HexDigit
 ;
 
fragment UnicodeEscapeSequence
 : 'u' HexDigit HexDigit HexDigit HexDigit
 ;

fragment SingleEscapeCharacter
 : ['"\\bfnrtv]
 ;

fragment NonEscapeCharacter
 : ~['"\\bfnrtv0-9xu\r\n]
 ;

fragment EscapeCharacter
 : SingleEscapeCharacter
 | DecimalDigit
 | [xu]
 ;

fragment LineContinuation
 : '\\' LineTerminatorSequence
 ;

fragment LineTerminatorSequence
 : '\r\n'
 | LineTerminator
 ;

fragment DecimalDigit
 : [0-9]
 ;
fragment HexDigit
 : [0-9a-fA-F]
 ;
fragment OctalDigit
 : [0-7]
 ;

fragment POINT
   : '.'
   ;

AND
   : 'AND'
   ;

OR
   : 'OR'
   ;

NUMBER
   : ('0' .. '9') ('0' .. '9')* POINT? ('0' .. '9')*
   ;

LPAREN
   : '('
   ;


RPAREN
   : ')'
   ;

LBRACKET
   : '['
   ;

RBRACKET
    : ']'
    ;

GT
   : '>'
   ;


LT
   : '<'
   ;


EQ
   : ':'
   ;

IN
   : 'IN'
   ;

NOT_IN
    : 'NOT IN'
    ;

NOT_EQ
   : '!'
   ;

COMMA
   : ','
   ;

IDENTIFIER
   : [A-Za-z0-9.]+
   ;

ENCODED_STRING
   : ~([ :<>!()])+
   ;

LineTerminator
: [\r\n\u2028\u2029] -> channel(HIDDEN)
;

WS
   : [ \t\r\n]+ -> skip
   ;

Solution

  • Your ENCODED_STRING is too greedy: it consumes the [ and , as well. I.e. ["john", becomes a single ENCODED_STRING token), which is not what you want.

    Do something like this instead:

    ENCODED_STRING
       : '"' ( ~[\\"] | '\\' . )* '"'
       ;
    

    and then your example input is parsed as:

    enter image description here