regextokenizeiso-8859-15re2c

Using re2c with ISO-8859-x


We have some text in ISO-8859-15 for which we want to tokenize. (ISO-8859-15 is ISO-8859-1 with the Euro sign and other common accented characters, for more details see ISO-8859-15).

I am trying to get the parser to recognize all the characters. The native character representation of the text editors I'm using is UTF-8, so to avoid hidden conversion problems, I'm restricting all re2c code to ASCII e.g.

LATIN_CAPITAL_LETTER_A_WITH_GRAVE      = "\xc0" ;
LATIN_CAPITAL_LETTER_A_WITH_ACUTE      = "\xc1" ;
LATIN_CAPITAL_LETTER_A_WITH_CIRCUMFLEX = "\xc2" ;
LATIN_CAPITAL_LETTER_A_WITH_TILDE      = "\xc3" ;
...

Then:

UPPER    = [A-Z] | LATIN_CAPITAL_LETTER_A_WITH_GRAVE
                 | LATIN_CAPITAL_LETTER_A_WITH_CIRCUMFLEX
                 | LATIN_CAPITAL_LETTER_AE
                 | LATIN_CAPITAL_LETTER_C_WITH_CEDILLA
                 | ...

WORD     = UPPER LOWER* | LOWER+ ;

It compiles no problem and runs great on ASCII, but stalls whenever it hits these extended characters.

Has anyone seen this, and is there a way to fix it?

Thank you,

Yimin


Solution

  • Yes, I've seen it. Has to do with comparison of signed vs unsigned types for bytes ≥ 128.

    Two ways to fix: use unsigned char as your default type, e.g. re2c:define:YYCTYPE = "unsigned char";, or -funsigned-char (if using gcc, other compilers have equivalent) as a compile flag. You can use the one that interferes with your existing code the least.