unicodeutf-8language-agnosticascii

Identifying ASCII characters in a UTF-8 byte stream


Reading bytes from a UTF-8 file (i.e. not processing strings), I need to unambiguously identify certain ASCII characters which are used as delimiters - much like CSV parsing.

While this seems simple at first, I'm afraid of false positives due to combining characters: Upon encountering a known character (e.g. colon; U+003A), how can I be sure it's not part of a larger sequence (grapheme cluster)?

For example, 3️⃣ ("keycap three") appears to be a composition of multiple bytes, starting with ASCII digit three (U+0033) - so if my delimiter happened to be U+0033, I'd have to make sure any such byte is a stand-alone ASCII character instead of a composition of multiple bytes.

Is this kind of distinction even possible without relying on complex libraries?


JavaScript example to illustrate processing (though I'm not interested in JavaScript per se here):

const LF = "\n".charCodeAt(0);   // 10 (U+000A)
const SPACE = " ".charCodeAt(0); // 32 (U+0020)
const COLON = ":".charCodeAt(0); // 58 (U+003A)

const content = "1: 2\n3 4";
const bytes = new TextEncoder().encode(content);
console.log(bytes);
// Uint8Array(8) [49, 58, 32, 50, 10, 51, 32, 52]

const parts = Array.from(bytes).map(byte => {
    switch (byte) {
        case LF:
            return "\n";
        case SPACE:
            return " ";
        case COLON:
            return ":";
        default:
            return byte;
    }
});
console.log(parts);
// [49, ':', ' ', 50, '\n', 51, ' ', 52]

Solution

  • Unicode annex 29 gives the rules for breaking between "user perceived characters" in section 3: https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

    All together the rules are complicated, but the ones that apply to your specific use case are simple:

    A regular ASCII letter or punctuation character is "standalone" unless it is followed by a "continuing character".

    A "continuing character" is either a Combining Mark or one of the join control characters U+200C or U+200D.

    "Combining Mark" is a unicode major category M: https://en.wikipedia.org/wiki/Unicode_character_property#General_Category

    In most languages there's a readily available library with a function to see if a code point is a combining mark.

    If you need to roll your own, you can get the required information from the unicode character database: https://www.unicode.org/ucd/

    The latest file is here: https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt

    Category is the 3rd column, and the combining marks are the ones with a category that starts with M

    Most of the combining marks are in ranges dedicated to combining marks:

    You should check these first.