pdfqpdf

How to reference text content in preformatted PDF (QDF)?


When editing the source of a PDF file with a text editor, I can use an object more than once by referencing it multiple times.

This example is a reference to object 15:

  /Resources <<
    /XObject <<
      /Fm0 15 0 R
    >>
  >>

If object 15 is text, this text will appear in a PDF viewer in every place it was referenced. But it will always be in the font and size defined in under /Resources and in the stream within object 15:

15 0 obj
<<
  /BBox [
    3.24609
    767.215
    507.739
    819.297
  ]
  /FormType 1
  /Resources <<
    /Font <<
      /F1 25 0 R
    >>
    /ProcSet [
      /PDF
      /Text
      /ImageB
      /ImageC
      /ImageI
    ]
  >>
  /Subtype /Form
  /Type /XObject
  /Length 16 0 R
>>
stream
q
0 g
BT
0 Tr
/F1 25 Tf
1 0 0 1 36 785.248 Tm
[(0123)] TJ
ET
Q
endstream
endobj

What I actually need is to have a string of characters (four digits actually) which is referred to in two or more places in the source of the PDF. However, the used font and size differ every time while the encoding of both fonts is the same (that is, after converting to QDF format, used characters are readable as plain text in the text editor – as long as they’re within the ASCII range).

So I guess what I’m looking for is two things:

  1. The correct way to add a text string to the PDF file so I can
  2. reference it from within different streams.

––> Is there a way to do this?

[The requirement is that, once ready, the four digits can be replaced by four different ones on any system if I put a comment above the line to be amended so they can easily find the right spot. Without having to install software or fonts (which are already embedded in the preformatted PDF) first, but just by using a text editor.]


Solution

  • Editing text in text editors is not a use case PDF was designed for. Thus, please don't be too surprised to hear that a generic solution for your requirement is not to be expected.

    There are certain special situations in which you may implement a solution. Nonetheless, the requirement as a whole is a bad idea.

    A Bad Requirement

    The requirement is that, once ready, the four digits can be replaced by four different ones on any system if I put a comment above the line to be amended so they can easily find the right spot. Without having to install software or fonts (which are already embedded in the preformatted PDF) first, but just by using a text editor.

    This is a bad requirement. Editing a PDF using a hex editor already is something delicate, but there you at least can be sure that you don't inadvertently change content you did not want to change. Numerous text editors are different in that regard and apply changes (which in case of actual text documents don't matter but in case of PDFs does).

    Postprocessing the edited PDFs with QPDF again relativizes this impact a bit, but errors will happen with this approach.

    Form XObjects

    Form XObjects are not really a solution for your problem as they do not represent isolated strings but complete, fully styled content pieces. They can be transformed to be resized or rotated, differently for every use, but they cannot be re-styled.

    (Well, you could try and use the obsolete option to use a Form XObject without a Resources entry; this object would inherit the resources of the page it is displayed on. Different pages may in their resources have a different font associated with the font name used in your XObject. This way, you could at least have that number in different styles on different pages. But as mentioned above, this construct is obsolete and, therefore, should not be used.)

    Page Content Streams

    If the text showing instructions for the four digit number in question need only occur at page level (i.e. not in some XObject, Pattern, etc.), you can take advantage of the fact that the page Contents may be arranged as a sequence of streams.

    For example:

    %PDF-1.7
    %äöü
    1 0 obj
    <<
    /Length 9
    >>
    stream
    (1234) Tj
    endstream
    endobj 
    [... more indirect objects ...]
    100 0 obj
    /Type/Page 
    /Contents [101 0 R 1 0 R 102 0 R 1 0 R 103 0 R]
    [... more page entries ... ]
    endobj
    101 0 obj
    <<
    /Length XXX
    >>
    stream
    [... some page specific drawing instructions ...]
    BT
    1 0 0 1 30 600 Tm
    /Font1 10 Tf
    endstream
    endobj 
    102 0 obj
    <<
    /Length XXX
    >>
    stream
    ET
    [... some page specific drawing instructions ...]
    BT
    0 1 -1 0 300 300 Tm
    /Font2 20 Tf
    endstream
    endobj 
    103 0 obj
    <<
    /Length XXX
    >>
    stream
    ET
    [... some page specific drawing instructions ...]
    endstream
    endobj 
    [... more indirect objects and the whole end-of-file stuff ...]
    

    Here the content of the page object 100 is spread over multiple streams:

    AcroForm Text Fields

    Another option would be to use AcroForm text fields.

    Such text fields have a single value but multiple visualizations (widget annotations), and different visualizations may have different DA default appearance values from which PDF viewers may construct the appearance.

    This doesn't work for all PDF viewers, though.

    Furthermore, in PDF-2.0 you are expected to provide actual appearance streams for all widgets.