pdfitextpdf-generationlibreofficepdftk

Strange encoding of a pdf stream


I'm studying the internal structure of pdf, so i created a file in libreoffice writer, writing only the string "Hello world" and exported it to pdf. So I uncompressed it with: pdftk hello_world.pdf output hello_world_unc.pdf uncompress and opened it with a text editor.

Analyzing the stream I get something strange like this: [<01>5<02>-6<03>2<03>2<040506>-2 <040703>2<08>]TJ which should represent "Hello world" as an array of hexadecimal strings (in the angle brackets), and integers to specify the spacing.

I state that the file contains only this string, created precisely for educational purposes.

The problem is that they don't look like hexadecimal characters to me as they should be. That is, surely the "H" is not represented with 01. I was expecting something like this: (Hello world) Tj.

Can anyone help me understand? Thanks in advance


Solution

  • These numbers are just indexes into the character map.

    Investigate the uncompressed PDF deeper. And you will find some lines like these:

    <01> <0048>
    <02> <0065>
    <03> <006C>
    <04> <006F>
    <05> <0020>
    <06> <0077>
    <07> <0072>
    <08> <0064>