ocamlocamllexocamlyacc

Specifying ocamllex encoding


I'm currently developing a parser according to a specification, and I'm completely unable to find anywhere in the docs information about text encoding. It sounds weird to me that the docs of a lexing library wouldn't mention text encoding at all, so I hope I just didn't miss parts of it.


Solution

  • Ocamllex works at the byte level and leaves all question of encoding to its users.

    More precisely, working at the "byte level" means that ocamllex considers that its input is a sequence of 8-bit words. The ocamllex regex engine then analyzes this sequence of 8-bit words.

    Unicode encodings can be seen as a layer of interpretation on the top of this raw sequence of 8 bits words. But the ocamllex lexer is unaware of this higher layer of interpretation and just looks at the raw sequence of 8-bit words (which is not that surprising since ocamllex and the first version of unicode were developed around the same time in the beginning of the nineties). In particular, the graphical character in the lexer are interpreted using their ASCII encoding and thus the character class

    let digit = ['0'-'9']
    

    means one byte in the interval [0x30, 0x39].

    If you want a lexer that is aware of unicode character classes and encoding, you can look at sedlex .