pdfcopy-pastepastehindiembedded-fonts

Copy-pasting Hindi from a correctly rendering PDF pastes only certain characters correctly


I have a PDF that renders correctly ([0]). If I try and copy-paste from it the words that paste are a little off. It doesn't happen with all of the text - only some of the words. The font (Devanagari MT) is embedded in the PDF; in one case I have the exact same font installed and yet it still does not paste correctly. I have attached an image to better illustrate what I am seeing. It's a bit busy so I'll break it down: Left side/background is the PDF open in Adobe Acrobat DC Reader on MacOS Mavericks (10.10); right side is pasted text into Notes (top) and Pages (bottom). The red rectangles outline that the font being used is common among all examples, green rectangles outline the part of the word that is correctly duplicated. Highlighted text represents the complete word being either copied or pasted. In addition to Adobe Acrobat DC reader I have also tried to copy from Preview (MacOS default for viewing PDFs). Outside of the picture (i.e. terminal, browser or random text boxes, etc.) and in Windows 10 I am getting identical results. Conversion to e.g. RTF or .docx also yields the same issue. What is missing or misconfigured? How do I solve this so I can reliably copy and paste? Thank you in advance for your ideas and insight.

Kind regards,

-jayce

[0] https://repositories.lib.utexas.edu/bitstream/handle/2152/41433/GlossariesAlive_01.pdf

EDIT: Mixed up Pages and Acrobat DC

hindi borked example


Solution

  • The character code used for the text in a PDF file need not have any direct relationship with any language coding. Here's what the PDF contains for the bit of text you are pointing at:

    /F1.0 1 Tf (these houses ) Tj ET Q q 1 0 0 -1 0 792 cm BT 11 0 0 -11 235.8 375
    Tm /F2.1 1 Tf (7) Tj ET Q q 1 0 0 -1 0 792 cm BT 11 0 0 -11 242.2346 375 Tm
    /F1.0 1 Tf ( ) Tj ET Q q 1 0 0 -1 0 792 cm BT 11 0 0 -11 244.9846 375 Tm /F2.1
    1 Tf [ (!) 0.2 ("#) -0.3 ($) ] TJ ET Q q 1 0 0 -1 0 792 cm BT 11 0 0 -11 235.8 406
    

    Now Tf selects a font (and point size), Tj draws text. BT and ET mean Begin Text Block and End Text Block q and Q mean gsvare and grestore, cm is concatmatrix, Tm is set text matrix, and TJ is another way to draw text.

    You can ignore most of these.

    Looking at just the important bits we have:

    /F1.0 1 Tf (these houses ) Tj
    /F2.1 1 Tf (7) Tj 
    /F1.0 1 Tf ( ) Tj
    /F2.1 1 Tf [ (!) 0.2 ("#) -0.3 ($) ] TJ
    

    Now you can see that the text in the font named 'F1.0' is encoded using ASCII (more or less), this font is AGaramondPro-Regular, using MacRomanEncoding:

    8 0 obj
    <<
      /Type /Font
      /Subtype /Type1
      /BaseFont /GFJJBF+AGaramondPro-Regular
      /FontDescriptor 54 0 R
      /Widths 55 0 R
      /FirstChar 32
      /LastChar 169
      /Encoding /MacRomanEncoding
    >>
    endobj
    

    The text using font 'F2.1' is your Devanagri font, defined as:

    10 0 obj
    <<
      /Type /Font
      /Subtype /TrueType
      /BaseFont /MWSGSJ+DevanagariMT
      /FontDescriptor 48 0 R
      /Widths 49 0 R
      /FirstChar 33
      /LastChar 105
      /ToUnicode 50 0 R
    >>
    endobj
    

    Notice this has no Encoding, but it does have a ToUnicode entry. Essentially this means the font has a non-standard custom Encoding. The subset font is defined in such a way that the character code maps directly to a specific glyph in the font's GLYF table (it's a TrueType font). Because it's not a standard Encoding, there's no way to know what the character codes mean. However, the ToUnicode CMap is intended to give you a mapping from character code to Unicode code point.

    THe ToUnicode CMap is Acrobat (and other viewers) first and best way to extract text. A properly constructed ToUnicode CMap should give you a direct Unicode code point from a given character code. The CMap in the file is :

    50 0 obj
    <<
      /Length 913
    >>
    stream
    /CIDInit /ProcSet findresource begin
    12 dict begin
    begincmap
    /CIDSystemInfo <<
      /Registry (Adobe)
      /Ordering (UCS)
      /Supplement 0
    >> def
    /CMapName /Adobe-Identity-UCS def
    /CMapType 2 def
    1 begincodespacerange
    <00><FF>
    endcodespacerange
    39 beginbfrange
    <21><21><092e>
    <22><22><0915>
    <23><23><093e>
    <24><24><0928>
    <25><25><092c>
    <26><26><095c>
    <27><27><0938>
    <2a><2a><0926>
    <2b><2b><0930>
    <2c><2c><091b>
    <2d><2d><094b>
    <2e><2e><091f>
    <2f><2f><090f>
    <32><32><0924>
    <33><33><0940>
    <34><34><092f>
    <35><35><0939>
    <36><36><0935>
    <39><39><0906>
    <3a><3a><0932>
    <3e><3e><092a>
    <46><46><0905>
    <49><49><095b>
    <4a><4a><095a>
    <4b><4b><091a>
    <51><51><0917>
    <52><52><091c>
    <58><58><0920>
    <5a><5b><095d>
    <5c><5c><0959>
    <5d><5d><0914>
    <60><60><0921>
    <61><61><094c>
    <62><62><092d>
    <63><63><0936>
    <64><64><093f>
    <65><65><0916>
    <66><66><0907>
    <68><68><0927>
    endbfrange
    endcmap
    CMapName currentdict /CMap defineresource pop
    end
    end
    endstream
    endobj
    

    Taking the first line:

    <21><21><092e>

    That means the character codes from 0x21 to 0x21 map to Unicode code points starting at 0x092e. Obviously that's a single character code, but it could be a range.

    Now you'll note that the CMap has 'holes' in the ranges, for instance there are no entries for 0x28 and 0x29.

    So taking your text, the characters are 7, !, ", #, $. Or, in hex 0x37, 0x21, 0x22, 0x23, 0x24 (you can see how the indices have been chosen, the first character in the file is 0x01, the second is 0x02 and so on, so the code to glyph mapping depends on the order the characters are used).

    So we run those numbers through the ToUnicode CMap, 0x37 maps to... Oops! There is no entry in the CMap for character code 0x37! 0x21 maps to 0x092e, 0x22 to 0x0915, 0x23 to 0x093e and 0x24 maps to 0x0928.

    So the latter four characters copy and paste correctly. Acrobat (and any other viewer) doesn't know what to do with character code 0x37, so it does the best it can and falls back to good old ASCII in the hope that it might be right, which is why the initial pasted character is a 7, that's 0x37 in ASCII.

    So that's your problem, the ToUnicode CMap does not contain a mapping to Unicode code points for all the character codes which are used in the PDF file. This is a fault of the PDF creation tool, Mac OS/X 10.6 Quartz PDF Cn=ontext or (since the file has been modified) the editing application, 'Pages'.

    How can you fix this ? Well you could hand-edit the ToUnicode CMap file and add entries for each character code. That would be a laborious process, because first you'd have to identify each character code in the text and figure out what its Unicode code point is. Also, PDF is a binary format, with a cross-reference table. If you make any insertions in the file then the xref table will be invalid and the PDF file effectively corrupted. Some viewers will be able to fix it, some won't.

    As I hinted above, a custom-encoded subset font is normally created so that the first character used in the document is given the character code 1, the second is 2 and so on. So for each document the actual mapping will be unique, it's not going to be possible to write some code to reliably do this for you, because there is no 'one size fits all' mapping.

    Basically you need to remake the PDF file using software which embeds a correct ToUnicode CMap in the PDF file.