cunicodeutf-8grapheme

How to check if a UTF-8 string starts with an 'a'


I have a UTF-8 string given as a null-terminated const char*. I would like to know if the first letter of this string is an a by itself. The following code

bool f(const char* s) {
  return s[0] == 'a';
}

is wrong, as the first letter (grapheme cluster) of the string might be à - made from 2 unicode scalar values: a and `. So this very simple question seems extremely difficult to answer, unless you know how grapheme clusters are made.

Still, many libraries parse UTF-8 files (YAML files, for instance) and therefore should be able to answer this kind of question. But these libraries don't seem to depend upon a Unicode library.

So my question are:


Solution

  • It simply doesn't matter.

    Consider: Is this string valid JSON?

    "̀"
    

    (That's the byte sequence 22 cc 80 22.)

    You seem to be arguing that it is not: Since a JSON string should start with " (QUOTATION MARK) but instead this starts with (QUOTATION MARK + COMBINING GRAVE ACCENT).

    The only reasonable response is that you're thinking at the wrong level: Text serialization is defined in terms of code points. Grapheme clusters are only considered for processing natural language and editing text.

    And this certainly is considered valid JSON.

    >>> json.loads(bytes.fromhex('22cc8022'))
    '̀'