In the language I'm grammaring here, I want to allow custom operator definitions whose identifiers fall into the Unicode categories S,P,Me (symbols, punctuations, enclosing-marks) but exclude the below bracketing chars and comma:
// Punctuation
L_PAREN : '(';
R_PAREN : ')';
L_CURLY : '{';
R_CURLY : '}';
L_BRACKET : '[';
R_BRACKET : ']';
COMMA : ',';
// Operator Identifiers
NO_OP : (L_PAREN | R_PAREN | L_BRACKET | R_BRACKET | L_CURLY | R_CURLY | COMMA);
IDENT_OP_LIKE : (UNICODE_OPISH ~NO_OP)+;
fragment UNICODE_OPISH : [\p{S}\p{P}\p{Me}];
but the ~NO_OP
part of IDENT_OP_LIKE
errors with: "rule reference NO_OP is not currently supported in a set" but I guess I'm just mistaken in my idea of how to express this exclusion-from-set.
Is there any better / supported way to express this in one's Antlr4 Lexers?
(I also tried fragment UNICODE_OPISH : [\p{S}\p{P}\p{Me}~{},[\]()];
but that complains about "chars used multiple times", understandable since they already were produced from the \p
s.)
~NO_OP
is simply not supported by ANTLR: you cannot negate another rule. You can do this: ~[(){}\[\],]
to match any character other than (
, )
, {
, }
, [
, ]
and ,
.
However, when doing:
IDENT_OP_LIKE : (UNICODE_OPISH ~[(){}\[\],])+;
the UNICODE_OPISH
will still match the characters negated by ~[(){}\[\],]
. Besides, ~[(){}\[\],]
would also match digits and letters.
If you want IDENT_OP_LIKE
to match any character from the sets \p{S}
, \p{P}
and \p{Me}
except (
, )
, {
, }
, [
, ]
and ,
, you must remove \p{P}
and create a set without these characters:
IDENT_OP_LIKE
: (UNICODE_OPISH | P_WITHOUT_SOME)+
;
fragment P_WITHOUT_SOME
: [,.?/;:"'_\-] // manually add all chars except `(`, `)`, `{`, `}`, `[`, `]` and `,`
;
fragment UNICODE_OPISH
: [\p{S}\p{Me}]
;