unicodearabicarabic-support

How to handle arabic unicode U+06A8 vs U+08C4 and U+08BC? Documentation unclear


Slightly similar question: difference between U+06A4 and U+06A8? (ARABIC LETTER VEH and ARABIC LETTER QAF WITH THREE DOTS ABOVE)

I am writing a script to handle different arabic unicode points based on if they are initial, medial, final, or isolated forms and returning the correct unicode point. For many characters in arabic unicode this is straightforward (see Arabic Presentation Forms-A) and take for example 067B ﭒ I am directly provided each of its forms and their respective hexadecmial codepoint for that form. If I read this character in an input stream I can know based on the joining characterstics of the left and right characters which glyph the letter should take: initial, medial, final, or isolated. I am confused on several letters however.

My question is specifically for the characters:

08BC ࢼ ARABIC LETTER AFRICAN QAF 08C4 ࣄ ARABIC LETTER AFRICAN QAF WITH THREE
DOTS ABOVE

Found here: https://www.unicode.org/charts/PDF/U08A0.pdf

and U+06A8 Found here: https://www.unicode.org/charts/PDF/U0600.pdf

U+08BC and U+08C4 do not have their different presentation forms written in the Presentation Forms A & B documents for Arabic explicitly they have them written inline in these other documents and I don't understand fully what they're trying to say for U+08BC. Are they saying that for its initial and medial forms that the point U+06A7 should be taken but in other forms U+066F should be taken? If so then what does U+08C4 have to do with this letter? Why is it in the notes?

Secondly, I don't understand what the notes are saying with U+08C4. What does it mean by "this letter shows" is it already showing the extra one dot in the glyph provided from the standard two dots above in U+0642? What two dotted code point would then produce U+08C4 for its initial and medial forms? Or is it saying that another one dot will need to be added if U+08C4 is found in either the initial or medial forms for a total of 4 dots above? How would one do that? What does U+08BC have to do with this letter in the notes?

Third and final question how is U+08C4 different from U+06A8. My guess is that the presentation forms are different and so they need to have different code points. I just want clarification here.

Thank you in advance


Solution

  • You may be misunderstanding the Arabic Presentation Forms. These exist mostly for backward compatibility, and they are occasionally useful on their own, but they are not intended as a general-purpose tool for displaying Arabic letter forms. As noted in the spec (page 387):

    Optional Features. Many other ligatures and contextual forms are optional, depending on the font and application. Some of these presentation forms are encoded in the ranges U+FB50..U+FDFF and U+FE70..U+FEFE. However, these forms should not be used in general interchange. Moreover, it is not expected that every Arabic font will contain all of these forms, nor that these forms will include all presentation forms used by every font.)

    If they happen to be useful to you, that's ok. But they're not particularly supported, and new ones will not be added. As noted in the Private-Use FAQ:

    The Arabic Presentation Forms-A block had a contiguous range of 32 unassigned code points, but as of 2001, when the need for more BMP noncharacters became apparent, it was already clear to the UTC that the encoding of many more Arabic presentation forms similar to those already in the Arabic Presentation Forms-A block would not be useful to anyone.

    Unicode is not meant to encode positional glyph variations. It is only intended to encode the "abstract character." Choosing the correct glyph based on context is the job of fonts.

    Regarding U+08BC, they are giving hints to the writers of rendering engines and fonts. U+08BC should always be encoded as U+08BC, regardless of position. But it can be useful to know where glyphs can be reused.

    So to your underlying goal:

    I am writing a script to handle different arabic unicode points based on if they are initial, medial, final, or isolated forms and returning the correct unicode point.

    As a rule, the correct code point will not change based on position. Only glyph selection or rendering adjustments will apply. If you're trying to cause certain forms to appear, consider using U+0640 (tatweel) to creating the needed "joins."