unicodecharacter-encodingcharacterspecial-charactersnon-unicode

Telugu Anu Script Text


About Indian language script which is losing characters when copy/pasted to browsers

I need to know about the character types and conversion of them to different supportable formats. My question is — I have text which is typed using Anu Script Software with Apple Keyboard. The text which is typed using Anu, cannot be used as input at any type of browsers or web WhatsApp also.

Can anyone solve this

The Text copied and Pasted displays like this:- 

And the Real Text is as shown as in the Below Screenshot:-

This image shows one language of India, typed using Anu Script Software

one Language of India typed using Anu Script Software


Solution

  • The character codes that were copied and pasted into the question are Unicode code points in the Unicode BMP (Basic Multilingual Plane) Private Use Area (PUA). The distinct points are:

    If you go to the Unicode Charts page and enter 'F020' as the code, it gives you UE000.pdf to download, which says:

    Private Use Area

    Range: E000-F8FF

    The Private Use Area does not contain any character assignments, consequently no character code charts or names lists are provided for this area.

    What this means is that the Anu Script Software is using Unicode points that have no international agreed meaning — the BMP PUA is, by definition, for 'private use' and the parties sharing data using the PUA must agree on what the code points mean and how to display them. They only work with software that understands the convention. You cannot use these code points except with software that understands what Anu Script Software does.

    Browsers will only understand those code points if they're made aware of where the relevant font is, which gets into intricate details and is probably platform specific. (I've no idea where to start!)

    The standard Unicode range for Telugu is U+0C00..U+0C7F.

    Telugu

    Range: 0C00–0C7F

    Your best bet is probably to analyze the similarities and differences between the code points used by Anu Script Software and the Unicode standard range for Telugu, and then use the Unicode standard codes. You might need to understand combining accents and various other aspects of Telugu.


    I don't know Telugu at all, so what follows may be inaccurate, but I think it more or less makes sense of what's in the Anu Script Software output:

    UTF-8 bytes      PUA        Telugu  Glyph
    0xEF 0x82 0x87 = U+F087 ==> U+0C08  ఈ
    0xEF 0x80 0xA0 = U+F020 ==> U+0020  space
    0xEF 0x82 0x80 = U+F080 ==> U+0C06  ఆ
    0xEF 0x81 0x9C = U+F05C ==> U+0C32  ల
    0xEF 0x81 0xAA = U+F06A \
    0xEF 0x83 0xA1 = U+F0E1 ==> U+0C2F  య  (three code points for one character)
    0xEF 0x81 0x94 = U+F054 /
    0xEF 0x80 0xAB = U+F02B ==> U+0C66  ౦
    0xEF 0x80 0xA0 = U+F020 ==> U+0020  space
    0xEF 0x83 0x82 = U+F0C2 
    0xEF 0x81 0xB3 = U+F073
    0xEF 0x80 0xAB = U+F02B
    0xEF 0x80 0xA6 = U+F026
    0xEF 0x82 0x83 = U+F083
    0xEF 0x81 0x94 = U+F054
    0xEF 0x80 0xA0 = U+F020 ==> U+0020  space
    0xEF 0x80 0xBC = U+F03C
    0xEF 0x82 0x8A = U+F08A
    0xEF 0x81 0x98 = U+F058
    0xEF 0x83 0xA6 = U+F0E6
    0xEF 0x81 0xB5 = U+F075
    0xEF 0x82 0xB2 = U+F0B2
    0xEF 0x83 0x92 = U+F0D2
    0xEF 0x81 0x9C = U+F05C
    0xEF 0x80 0xA0 = U+F020 ==> U+0020  space
    0xEF 0x83 0xA7 = U+F0E7 ==> U+0C46 U+0C66  ౦ె (Note 1)
    0xEF 0x82 0xBF = U+F0BF
    0xEF 0x83 0xAC = U+F0EC
    0xEF 0x83 0x94 = U+F0D4
    0xEF 0x83 0xA1 = U+F0E1
    0xEF 0x80 0xAB = U+F02B
    0xEF 0x80 0xA0 = U+F020 ==> U+0020  space
    0xEF 0x81 0xB3 = U+F073
    0xEF 0x82 0x90 = U+F090
    0xEF 0x83 0xA7 = U+F0E7
    0xEF 0x81 0xB7 = U+F077
    0xEF 0x82 0x9F = U+F09F
    0xEF 0x82 0xBC = U+F0BC
    0xEF 0x80 0xA0 = U+F020 ==> U+0020  space
    0xEF 0x80 0xBC = U+F03C
    0xEF 0x83 0xBB = U+F0FB
    0xEF 0x81 0xB9 = U+F079
    0xEF 0x82 0x90 = U+F090
    0xEF 0x80 0xBC = U+F03C
    0xEF 0x82 0x91 = U+F091
    0xEF 0x81 0xAA = U+F06A
    0xEF 0x83 0xA1 = U+F0E1
    0xEF 0x81 0x94 = U+F054
    0xEF 0x80 0xA0 = U+F020 ==> U+0020  space
    0xEF 0x80 0xBC = U+F03C
    0xEF 0x82 0x8A = U+F08A
    0xEF 0x81 0xB3 = U+F073
    0xEF 0x82 0x90 = U+F090
    0xEF 0x82 0x88 = U+F088
    0xEF 0x80 0xBC = U+F03C
    0xEF 0x82 0x91 = U+F091
    0xEF 0x81 0xAA = U+F06A \
    0xEF 0x83 0xA1 = U+F0E1 ==> U+0C2F  య
    0xEF 0x81 0x94 = U+F054 /
    

    Note 1: The TELUGU VOWEL SIGN E U+0C46 should combine with TELUGU DIGIT ZERO U+0C66 — if I've identified the characters correctly, which seems improbable. I will leave off trying here; I recognize some shapes by matching what you show in the image with the Unicode chart page, but I'm not confident of the mapping to the PUA code points.

    You should be able to get appropriate information from the people who provided the Anu Script Software.