We have some text in ISO-8859-15 for which we want to tokenize. (ISO-8859-15 is ISO-8859-1 with the Euro sign and other common accented characters, for more details see ISO-8859-15).
I am trying to get the parser to recognize all the characters. The native character representation of the text editors I'm using is UTF-8, so to avoid hidden conversion problems, I'm restricting all re2c
code to ASCII e.g.
LATIN_CAPITAL_LETTER_A_WITH_GRAVE = "\xc0" ;
LATIN_CAPITAL_LETTER_A_WITH_ACUTE = "\xc1" ;
LATIN_CAPITAL_LETTER_A_WITH_CIRCUMFLEX = "\xc2" ;
LATIN_CAPITAL_LETTER_A_WITH_TILDE = "\xc3" ;
...
Then:
UPPER = [A-Z] | LATIN_CAPITAL_LETTER_A_WITH_GRAVE
| LATIN_CAPITAL_LETTER_A_WITH_CIRCUMFLEX
| LATIN_CAPITAL_LETTER_AE
| LATIN_CAPITAL_LETTER_C_WITH_CEDILLA
| ...
WORD = UPPER LOWER* | LOWER+ ;
It compiles no problem and runs great on ASCII, but stalls whenever it hits these extended characters.
Has anyone seen this, and is there a way to fix it?
Thank you,
Yimin
Yes, I've seen it. Has to do with comparison of signed vs unsigned types for bytes ≥ 128.
Two ways to fix: use unsigned char
as your default type, e.g. re2c:define:YYCTYPE = "unsigned char";
, or -funsigned-char
(if using gcc
, other compilers have equivalent) as a compile flag. You can use the one that interferes with your existing code the least.