I am using ANTLR to parse a search query input like:
age > 25
or
firstname:"john"
The tree typically looks like this:
I want to add support for IN operator and array values, to be able to parse queries like:
firstname IN ["jane", "joe"]
I tried to add a syntactic rule for arrays in my value rule.
array
: LBRACKET (value (COMMA value)* )? RBRACKET
;
value
: array
| IDENTIFIER
| STRING
| ENCODED_STRING
| NUMBER
| BOOL
;
The problem is that the parser recognize the array itself as an ENCODED_STRING.
firstname IN ["john", "jane"]
Here is my full grammar file
grammar Query;
// syntactic rules
input
: query EOF
;
query
: left=query logicalOp=(AND | OR) right=query #opQuery
| LPAREN query RPAREN #priorityQuery
| criteria #atomQuery
;
criteria
: key op value
;
key
: IDENTIFIER
;
array
: LBRACKET (value (COMMA value)* )? RBRACKET
;
value
: array
| IDENTIFIER
| STRING
| ENCODED_STRING
| NUMBER
| BOOL
;
op
: EQ
| GT
| LT
| NOT_EQ
| IN
| NOT_IN
;
// lexical rules
BOOL
: 'true'
| 'false'
;
STRING
: '"' DoubleStringCharacter* '"'
| '\'' SingleStringCharacter* '\''
;
fragment DoubleStringCharacter
: ~["\\\r\n]
| '\\' EscapeSequence
| LineContinuation
;
fragment SingleStringCharacter
: ~['\\\r\n]
| '\\' EscapeSequence
| LineContinuation
;
fragment EscapeSequence
: CharacterEscapeSequence
| HexEscapeSequence
| UnicodeEscapeSequence
;
fragment CharacterEscapeSequence
: SingleEscapeCharacter
| NonEscapeCharacter
;
fragment HexEscapeSequence
: 'x' HexDigit HexDigit
;
fragment UnicodeEscapeSequence
: 'u' HexDigit HexDigit HexDigit HexDigit
;
fragment SingleEscapeCharacter
: ['"\\bfnrtv]
;
fragment NonEscapeCharacter
: ~['"\\bfnrtv0-9xu\r\n]
;
fragment EscapeCharacter
: SingleEscapeCharacter
| DecimalDigit
| [xu]
;
fragment LineContinuation
: '\\' LineTerminatorSequence
;
fragment LineTerminatorSequence
: '\r\n'
| LineTerminator
;
fragment DecimalDigit
: [0-9]
;
fragment HexDigit
: [0-9a-fA-F]
;
fragment OctalDigit
: [0-7]
;
fragment POINT
: '.'
;
AND
: 'AND'
;
OR
: 'OR'
;
NUMBER
: ('0' .. '9') ('0' .. '9')* POINT? ('0' .. '9')*
;
LPAREN
: '('
;
RPAREN
: ')'
;
LBRACKET
: '['
;
RBRACKET
: ']'
;
GT
: '>'
;
LT
: '<'
;
EQ
: ':'
;
IN
: 'IN'
;
NOT_IN
: 'NOT IN'
;
NOT_EQ
: '!'
;
COMMA
: ','
;
IDENTIFIER
: [A-Za-z0-9.]+
;
ENCODED_STRING
: ~([ :<>!()])+
;
LineTerminator
: [\r\n\u2028\u2029] -> channel(HIDDEN)
;
WS
: [ \t\r\n]+ -> skip
;
Your ENCODED_STRING
is too greedy: it consumes the [
and ,
as well. I.e. ["john",
becomes a single ENCODED_STRING
token), which is not what you want.
Do something like this instead:
ENCODED_STRING
: '"' ( ~[\\"] | '\\' . )* '"'
;
and then your example input is parsed as: