unicodehebrewunicode-stringunicode-normalization

Unicode Composition on Hebrew Characters Javascript


Question: judging from this list, am I understanding it correctly that the two Hebrew characters bet (U+05D1) and dagesh (U+05BC) cannot be normalized/composed into bet with dagesh (U+FB31)?

Context: I know that when Hebrew text is normalized, it is in a way not typically suited for historical linguistics. I have a package that sequences the characters into the preferred way, but I would to be able to recompose them:

const sequenced = 'בָּ'; // bet + dagesh + qamets — the preferred sequencing
const presentationForm = 'בָּ'; // bet with dagesh + qamets
if (sequenced.normalize("NFC") === presentationForm){
    console.log('Want these two to match...');
}

Other resources:


Solution

  • Your understanding is correct. Certain sequences are excluded from (re)composition under NFC. In this case, the decomposed version is always the canonical form.

    This doesn't mean that you can't use the composed codepoint but it won't survive any form of normalization.