Reading bytes from a UTF-8 file (i.e. not processing strings), I need to unambiguously identify certain ASCII characters which are used as delimiters - much like CSV parsing.
While this seems simple at first, I'm afraid of false positives due to combining characters: Upon encountering a known character (e.g. colon; U+003A), how can I be sure it's not part of a larger sequence (grapheme cluster)?
For example, 3️⃣ ("keycap three") appears to be a composition of multiple bytes, starting with ASCII digit three (U+0033) - so if my delimiter happened to be U+0033, I'd have to make sure any such byte is a stand-alone ASCII character instead of a composition of multiple bytes.
Is this kind of distinction even possible without relying on complex libraries?
JavaScript example to illustrate processing (though I'm not interested in JavaScript per se here):
const LF = "\n".charCodeAt(0); // 10 (U+000A)
const SPACE = " ".charCodeAt(0); // 32 (U+0020)
const COLON = ":".charCodeAt(0); // 58 (U+003A)
const content = "1: 2\n3 4";
const bytes = new TextEncoder().encode(content);
console.log(bytes);
// Uint8Array(8) [49, 58, 32, 50, 10, 51, 32, 52]
const parts = Array.from(bytes).map(byte => {
switch (byte) {
case LF:
return "\n";
case SPACE:
return " ";
case COLON:
return ":";
default:
return byte;
}
});
console.log(parts);
// [49, ':', ' ', 50, '\n', 51, ' ', 52]
Unicode annex 29 gives the rules for breaking between "user perceived characters" in section 3: https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
All together the rules are complicated, but the ones that apply to your specific use case are simple:
A regular ASCII letter or punctuation character is "standalone" unless it is followed by a "continuing character".
A "continuing character" is either a Combining Mark or one of the join control characters U+200C or U+200D.
"Combining Mark" is a unicode major category M: https://en.wikipedia.org/wiki/Unicode_character_property#General_Category
In most languages there's a readily available library with a function to see if a code point is a combining mark.
If you need to roll your own, you can get the required information from the unicode character database: https://www.unicode.org/ucd/
The latest file is here: https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
Category is the 3rd column, and the combining marks are the ones with a category that starts with M
Most of the combining marks are in ranges dedicated to combining marks:
You should check these first.