[SOLVED] /Differences dictionary for encode parsing issue in PDF

/Differences dictionary for encode parsing issue in PDF

Type1 font /Differences encoding uses strings in mapping of values for example 1 character is encoded to 'one'. It is used for numbers and special characters only.

What is the standard way to use these encoding?

How should I decode string from PDF which uses such encoding?

Link for the file: http://www.filedropper.com/open

Solution

Here's the /Differences array in your file (and honestly, you should have just posted this and not a link a skeevy download page):

/Differences [
    24 /breve/caron/circumflex/dotaccent/hungarumlaut/ogonek/ring/tilde
    39 /quotesingle
    96 /grave
    128 /bullet/dagger/daggerdbl/ellipsis...
]

The way this works is that the font also has an encoding associated with it (for example /MacRoman or /WinANSI). In the case of a Type 1 font, there is an encoding built into the font. Then given a copy of that encoding, you apply the differences to it. Start from the number (your first is 24), you change entries 24-31 inclusive to /breve, /circumflex and so on.

In Type 1 fonts, there is a dictionary called /CharStrings, which an association of a name of a glyph with the data/code that will render it. If, for example, you get a character with code 26, you look it up in your encoding array (which should be a 256 element array for Type 1 fonts) and with the differences applied, you get the name /circumflex. You then look that up in the CharStrings dictionary, pull out the glyph data and render it. Any character that does not exist in the encoding should be set to /.notdef which will then render an shape representing an undefined character (usually an empty box).

Now likely your problem is, how do I turn these glyph names in something that is more useful like, say Unicode?

If you look in Annex D, you'll see a set of tables that define the character sets for standard Latin encodings. You would make a lookup table that maps Adobe standard names to Unicode. Unfortunately, the tables in Annex D are incomplete. Fortunately, Adobe has a file that defines all of that for you here. There is a link in that file which is now dead, but most likely it was meant to go here.