I am looking for a way to count ligatures as single units as they are displayed to user, e.g. https://www.compart.com/en/unicode/U+FEFB.
When this character is typed (type G on Arabic keyboard), it's inserted in decomposition form, i.e. U+0644 U+0627
.
I'm able to decompose U+FEFB
by
escape(String.fromCodePoint(0xFEFB).normalize("NFKD")) // '%u0644%u0627'
Is there a way to compose U+0644 U+0627
into 0xFEFB
?
Why this does work?
escape(String.fromCodePoint(0x0644, 0x0627).normalize("NFKC"))
The only idea I has was to iterate over unicode ranges I'm interested in, decompose and create a map, but I'm hoping there's a better way.
Given that the ES2019 spec requires the implementation to:
Let ns be the String value that is the result of normalizing S into the normalization form named by f as specified in https://unicode.org/reports/tr15/.
and given that https://www.unicode.org/Public/12.1.0/ucd/NormalizationTest.txt describes that character as
FEFB;FEFB;FEFB;0644 0627;0644 0627; # (ﻻ; ﻻ; ﻻ; لا; لا; ) ARABIC LIGATURE LAM WITH ALEF ISOLATED FORM
it is the compliant behaviour. See
# 1. The following invariants must be true for all conformant implementations
#
# NFC
# c2 == toNFC(c1) == toNFC(c2) == toNFC(c3)
# c4 == toNFC(c4) == toNFC(c5)
#
# NFD
# c3 == toNFD(c1) == toNFD(c2) == toNFD(c3)
# c5 == toNFD(c4) == toNFD(c5)
#
# NFKC
# c4 == toNFKC(c1) == toNFKC(c2) == toNFKC(c3) == toNFKC(c4) == toNFKC(c5)
#
# NFKD
# c5 == toNFKD(c1) == toNFKD(c2) == toNFKD(c3) == toNFKD(c4) == toNFKD(c5)
No normalisation converts either c4
or c5
form back to c1
, or c2
, or c3
.
So to my unicode-amateur opinion there is no standard-compliant way to normalise U+0644 U+0627
back to U+FEFB
.