gocompiler-constructionlexerignore-case

How to implement case insensitive lexical parser in Golang using gocc?


I need to build a lexical analyzer using Gocc, however no option to ignore case is mentioned in the documentation and I haven't been able to find anything related. Anyone have any idea how it can be done or should I use another tool?

/* Lexical part */

_digit : '0'-'9' ;

int64 : '1'-'9' {_digit} ;

switch:  's''w''i''t''c''h';

while:  'w''h''i''l''e';

!whitespace : ' ' | '\t' | '\n' | '\r' ;

/* Syntax part */

<< 
import(
    "github.com/goccmack/gocc/example/calc/token"
    "github.com/goccmack/gocc/example/calc/util"
)
>>

Calc : Expr;

Expr :
        Expr "+" Term   << $0.(int64) + $2.(int64), nil >>
    |   Term            
;

Term :
        Term "*" Factor << $0.(int64) * $2.(int64), nil >>
    |   Factor          
;

Factor :
        "(" Expr ")"    << $1, nil >>
    |   int64           << util.IntValue($0.(*token.Token).Lit) >>
;

For example, for "switch", I want to recognize no matter if it is uppercase or lowercase, but without having to type all the combinations. In Bison there is the option % option caseless, in Gocc is there one?


Solution

  • Looking through the docs for that product, I don't see any option for making character literals case-insensitive, nor do I see any way to write a character class, as in pretty well every regex engine and scanner generator. But nothing other than tedium, readability and style stops you from writing

    switch:  ('s'|'S')('w'|'W')('i'|'I')('t'|'T')('c'|'C')('h'|'H');
    while:  ('w'|'W')('h'|'H')('i'|'I')('l'|'L')('e'|'E');
    

    (That's derived from the old way of doing it in lex without case-insensitivity, which uses character classes to make it quite a bit more readable:

    [sS][wW][iI][tT][cC][hH]    return T_SWITCH;
    [wW][hH][iI][lL][eE]        return T_WHILE;
    

    You can come closer to the former by defining 26 patterns:

    _a: 'a'|'A';
    _b: 'b'|'B';
    _c: 'c'|'C';
    _d: 'd'|'D';
    _e: 'e'|'E';
    _f: 'f'|'F';
    _g: 'g'|'G';
    _h: 'h'|'H';
    _i: 'i'|'I';
    _j: 'j'|'J';
    _k: 'k'|'K';
    _l: 'l'|'L';
    _m: 'm'|'M';
    _n: 'n'|'N';
    _o: 'o'|'O';
    _p: 'p'|'P';
    _q: 'q'|'Q';
    _r: 'r'|'R';
    _s: 's'|'S';
    _t: 't'|'T';
    _u: 'u'|'U';
    _v: 'v'|'V';
    _w: 'w'|'W';
    _x: 'x'|'X';
    _y: 'y'|'Y';
    _z: 'z'|'Z';
    

    and then explode the string literals:

    switch: _s _w _i _t _c _h;
    while: _w _h _i _l _e;