c++cpdfqpdf

In PDF, if Encoding and ToUnicode are both present in PDF, how to map the text extraction?


I used qpdf to uncompress a PDF file and below is the output. You can see that there both, encoding and ToUnicode, are present. If there is only ToUnicode I know how to map individual characters with Cmap file. But if you see output of Content stream is following

Tf
0.999402 0 0 1 71.9995 759.561 Tm
[()-2.11826()-1.14177()2.67786()-2.11826()8.55269()-5.44998()-4.70186()2.67786()-2.32338()2.67786()12.679(   )-3.75591()9.73429()]TJ

in break-at there are some garbage data that is not visible. So how to link data to cmap file ?

And one another question is that in /Encoding what are values contain in Difference ?

10 0 obj
<< /BaseEncoding /WinAnsiEncoding /Differences [ 1 /g100 /g28 /g94 /g3 /g87 /g24 /g38 /g47 /g62 ] /Type /Encoding >>

Even if I pass one by one values of Difference array into one of FreeType function is named as FT_Get_Name_Indek. This function return values like [ 100 28 94 3 87 24 38 47 62]

What is those values ? how to map those Value ?

here is pdf

run following cmd

qpdf --stream-data=uncompress input.pdf output.text

output.text

I got the same output if I pass contents stream data into zlib. kindly check output.txt file from link


Solution

  • Firstly the general question

    how to exract the text in pdf if encoding and ToUnicode both are present in pdf? how to map it?

    [...] if you see there are encoding and ToUnicode both are present in pdf. i know if only ToUnicode is there so how to map individual char with Cmap file.

    In such a case, i.e. when you have both a sufficiently complete and correct ToUnicode map and an Encoding for a font, you can ignore the Encoding and only use the ToUnicode map.

    This follows from the PDF specification which in section 9.10.2 "Mapping Character Codes to Unicode Values" states that the methods to map a character code to a Unicode value with the highest priority is

    If the font dictionary contains a ToUnicode CMap (see 9.10.3, "ToUnicode CMaps"), use that CMap to convert the character code to Unicode.

    Thus, if you (as you say) already know how to extract text if there only is a ToUnicode map, you can use the same algorithm unchanged. And as a corollary, if that doesn't work, the ToUnicode map in question is insufficiently complete or incorrect, or your knowledge itself on how to extract text using only a ToUnicode map actually is incomplete.

    Secondly the sample document

    You wrote

    [()-2.11826()-1.14177()2.67786()-2.11826()8.55269()-5.44998()-4.70186()2.67786()-2.32338()2.67786()12.679( )-3.75591()9.73429()]TJ

    in break-at there are some garbag data that is not visible. so how to link data to cmap file ?

    In the brackets there are the values identifying your glyphs, so they definitively are not garbage.

    Thus, here are the byte values from within the brackets:

    [(
        01
    )-2.11826(
        02
    )-1.14177(
        03
    )2.67786(
        01
    )-2.11826(
        04
    )8.55269(
        05
    )-5.44998(
        06
    )-4.70186(
        07
    )2.67786(
        04
    )-2.32338(
        07
    )2.67786(
        08
    )12.679(
        09
    )-3.75591(
        02
    )9.73429(
        04
    )]TJ
    

    Using the ToUnicode map of the font in question

    /CIDInit /ProcSet findresource begin
    12 dict begin
    begincmap
    /CMapType 2 def
    1 begincodespacerange
    <00><ff>
    endcodespacerange
    9 beginbfrange
    <01><01><0054>
    <02><02><0045>
    <03><03><0053>
    <04><04><0020>
    <05><05><0050>
    <06><06><0044>
    <07><07><0046>
    <08><08><0049>
    <09><09><004c>
    endbfrange
    endcmap
    CMapName currentdict /CMap defineresource pop
    end end 
    

    the byte values from within the brackets map to:

        01    0054    "T"
        02    0045    "E"
        03    0053    "S"
        01    0054    "T"
        04    0020    " "
        05    0050    "P"
        06    0044    "D"
        07    0046    "F"
        04    0020    " "
        07    0046    "F"
        08    0049    "I"
        09    004c    "L"
        02    0045    "E"
        04    0020    " "
    

    Thus,

    "TEST PDF FILE "
    

    which matches the rendered file just fine:

    Screenshot

    Thirdly the encoding

    and one another question is that in /Encoding what are values contain in Difference ?

    10 0 obj << /BaseEncoding /WinAnsiEncoding /Differences [ 1 /g100 /g28 /g94 /g3 /g87 /g24 /g38 /g47 /g62 ] /Type /Encoding >>

    According to the PDF specification,

    The value of the Differences entry shall be an array of character codes and character names organized as follows:

    code1 name1,1 name1,2

    code2 name2,1 name2,2

    coden namen,1 namen,2

    Each code shall be the first index in a sequence of character codes to be changed. The first character name after the code becomes the name corresponding to that code. Subsequent names replace consecutive code indices until the next code appears in the array or the array ends. These sequences may be specified in any order but shall not overlap.

    Thus, the encoding entry in your case says that the encoding basically is WinAnsiEncoding with the difference that the codes 1, ..., 9 instead represent the glyphs named /g100, /g28, /g94, /g3, /g87, /g24, /g38, /g47, and /g62 respectively.

    As these glyph names are no standard glyph names, the PDF specification does not consider this encoding helpful for text extraction because it only describes a method for a simple font

    that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font (see Annex D)

    The "/gXX" names in your sample clearly are not among them.