antlrantlr4

How to exclude specific characters from a `\p{..}` unicode set in an Antlr4 Lexer?


In the language I'm grammaring here, I want to allow custom operator definitions whose identifiers fall into the Unicode categories S,P,Me (symbols, punctuations, enclosing-marks) but exclude the below bracketing chars and comma:

// Punctuation

L_PAREN   : '(';
R_PAREN   : ')';
L_CURLY   : '{';
R_CURLY   : '}';
L_BRACKET : '[';
R_BRACKET : ']';
COMMA     : ',';

// Operator Identifiers

NO_OP : (L_PAREN | R_PAREN | L_BRACKET | R_BRACKET | L_CURLY | R_CURLY | COMMA);

IDENT_OP_LIKE : (UNICODE_OPISH ~NO_OP)+;

fragment UNICODE_OPISH : [\p{S}\p{P}\p{Me}];

but the ~NO_OP part of IDENT_OP_LIKE errors with: "rule reference NO_OP is not currently supported in a set" but I guess I'm just mistaken in my idea of how to express this exclusion-from-set.

Is there any better / supported way to express this in one's Antlr4 Lexers?

(I also tried fragment UNICODE_OPISH : [\p{S}\p{P}\p{Me}~{},[\]()]; but that complains about "chars used multiple times", understandable since they already were produced from the \ps.)


Solution

  • ~NO_OP is simply not supported by ANTLR: you cannot negate another rule. You can do this: ~[(){}\[\],] to match any character other than (, ), {, }, [, ] and ,.

    However, when doing:

    IDENT_OP_LIKE : (UNICODE_OPISH ~[(){}\[\],])+;
    

    the UNICODE_OPISH will still match the characters negated by ~[(){}\[\],]. Besides, ~[(){}\[\],] would also match digits and letters.

    If you want IDENT_OP_LIKE to match any character from the sets \p{S}, \p{P} and \p{Me} except (, ), {, }, [, ] and ,, you must remove \p{P} and create a set without these characters:

    IDENT_OP_LIKE
     : (UNICODE_OPISH | P_WITHOUT_SOME)+
     ;
    
    fragment P_WITHOUT_SOME
     : [,.?/;:"'_\-] // manually add all chars except `(`, `)`, `{`, `}`, `[`, `]` and `,`
     ;
    
    fragment UNICODE_OPISH
     : [\p{S}\p{Me}]
     ;