iospdfcgpdfscanner

CGPDFScanner - \x15 character while scanning


I am trying to extract the text of page 5 in pdf.
The pdf have a font YLJAAA+CMSY10 which has no mappings (CMap) or even encodings (default encoding or /Differences).
While extracting text, after string "tetex package" CGPDFScanner returns "\x15" character which is encountered many times.
When this character is encountered current font is the above mentioned font which has nothing to extract the text from pdf string. What is this \x15 character?

Thanks.


Solution

  • I found 2 (not "many") occurrences of this:

    [ (\025) ] TJ
    

    which is a number in octal – this is the number that is \x15 in hexadecimal.

    The font definition for "YLJAA+CMSY10" in the PDF carries no special encoding, so it has the default encoding for "CMSY" ("Computer Modern Symbol"):

    114 0 obj
    <<
      /Type         /Font
      /Subtype      /Type1
      /BaseFont     210 0 R % -> "/YLJAAA+CMSY10"
      /FirstChar    0
      /FontDescriptor 211 0 R
      /LastChar     127
      /Widths       204 0 R
    >>
    
    211 0 obj
    <<
      /Ascent       750
      /CapHeight    683
      /CharSet      (/bullet/greaterequal/arrowright/arrowdblright/element/negationslash/backslash/radical)
      /Descent      0
      /Flags        4
      /FontBBox     [ -29 -960 1116 775 ]
      /FontFile     205 0 R
      /FontName     210 0 R   % -> '/YLJAAA+CMSY10'
      /ItalicAngle  -14
      /StemV        85
      /XHeight      430
    >>
    endobj
    

    In itself, this still says nothing definitive: a PDF producer may reorder glyphs and encodings at will, as long as it does the same with the embedded font). Assuming the font set is not reordered, checking a random list of CMxx encodings shows that the character code 0x1F could well be GREATER-THAN OR EQUAL TO (Unicode U+2265).

    Acrobat agrees; inspecting the font in the PDF shows that character code 21 (decimal) is named 'GREATER-THAN OR EQUAL' and looks like it as well.