I am trying to extract the text of page 5 in pdf.
The pdf have a font YLJAAA+CMSY10 which has no mappings (CMap) or even encodings (default encoding or /Differences).
While extracting text, after string "tetex package" CGPDFScanner returns "\x15" character which is encountered many times.
When this character is encountered current font is the above mentioned font which has nothing to extract the text from pdf string.
What is this \x15 character?
Thanks.
I found 2 (not "many") occurrences of this:
[ (\025) ] TJ
which is a number in octal – this is the number that is \x15
in hexadecimal.
The font definition for "YLJAA+CMSY10" in the PDF carries no special encoding, so it has the default encoding for "CMSY" ("Computer Modern Symbol"):
114 0 obj
<<
/Type /Font
/Subtype /Type1
/BaseFont 210 0 R % -> "/YLJAAA+CMSY10"
/FirstChar 0
/FontDescriptor 211 0 R
/LastChar 127
/Widths 204 0 R
>>
211 0 obj
<<
/Ascent 750
/CapHeight 683
/CharSet (/bullet/greaterequal/arrowright/arrowdblright/element/negationslash/backslash/radical)
/Descent 0
/Flags 4
/FontBBox [ -29 -960 1116 775 ]
/FontFile 205 0 R
/FontName 210 0 R % -> '/YLJAAA+CMSY10'
/ItalicAngle -14
/StemV 85
/XHeight 430
>>
endobj
In itself, this still says nothing definitive: a PDF producer may reorder glyphs and encodings at will, as long as it does the same with the embedded font). Assuming the font set is not reordered, checking a random list of CMxx encodings shows that the character code 0x1F
could well be GREATER-THAN OR EQUAL TO (Unicode U+2265).
Acrobat agrees; inspecting the font in the PDF shows that character code 21
(decimal) is named 'GREATER-THAN OR EQUAL' and looks like it as well.