pdfpdfbox

In a PDF document, how can I "split" a TJ operator into multiple Tj/Td that are completely equivalent?


A related question here: (TJ and Td offset difference) has an answer which explains how to do the opposite (turn Tjs and Tds into an equivalent TJ). From this I can deduce fairly trivially how to do it the other way round, but this naive approach has the problem that other (seemingly unrelated) bits of text also get shifted when I do this (in a manner I don't exactly understand).

A caveat in the answer the the question linked above points out that "even then the two forms are not identical because the text line matrix at the end differs" which is probably the cause of these seemingly random jumps in seemingly unrelated places.

I'm a PDF noob and not entirely sure what exactly a "text line" (or text line matrix) is.

I'm using PDFBox, in particular PdfContentStreamEditor (here) so an answer using that toolset would be especially great, but a good general "how to" would certainly get me a long way towards it.

Thanks in advance.


Solution

  • As John Whitington pointed out in a comment, the details of PDF text showing are explained in the PDF spec ISO 32000-2 the current version of which can be retrieved for free from the PDF Association at https://pdfa.org/resource/iso-32000-pdf/

    The relevant details:

    Concerning your question

    how can I "split" a TJ operator into multiple Tj/Td that are completely equivalent?

    you, therefore, in general cannot faithfully replace a TJ by an equivalent sequence of Tj and Td instructions: The former only changes the text matrix (so that afterwards the text line matrix has not moved) while the latter changes both the text line matrix and the text matrix. And you cannot set the text line matrix to its former value without also changing the text matrix.

    What you can do, though, is remember the most recent value of the text line matrix. After you then replaced a TJ by Tjs and Tds, you continue processing the next instructions (maybe even replacing other TJs by Tjs and Tds), and as soon as you encounter an instruction that uses the text line matrix, you insert right before it a Tm that sets the text line matrix to the remembered value (alternatively: Undo all the effects of your replacement Tds with a Td with the negative sum of all your changes). If you encounter a Tm, though, you can stop as the Tm sets both text matrix and text line matrix absolutely.

    For example, consider you have a text object like this:

    BT
    30 600 Td
    [(A) 34 (B) -567 (C)] TJ
    0 -15 Td
    [(D) -1500 (E)] TJ
    (FGH) Tj
    0 -15 TD
    [(I) 17 (J)] TD
    (K) '
    ET
    

    You can transform it like this:


    An alternative would be to use character spacing: draw the last glyph of a string separately, before it set the character spacing to include all extra space, and after it reset the character spacing to the former value.