pdfaccessibilitypdfboxtaggingpdf-manipulation

Calculating the exact positions of(Td, TD, Tm, cm, T*) content stream in pdf?


Getting or calculating the exact positions of(Td, TD, Tm, cm, T*) content stream in pdf?

As a human I am able to calculate(whether it is replacing last Td or adding to last Td or multiplication with fontsize) the positions of tags in pdf content stream by comparing , where the glyphs are located in pdf and content stream position values. But I am unable to calculate perfect positions of glyph's programatically . Please see the screen short.

In above image left side box is pdf ui glyphs and right side box contains the related content stream. In content stream I highlighted two Td positions.

In first circle

3.321 -6.475999832 Td

The Td positions should add to the last Td positions. Assume x1, y1.

Current_x_pos = x1+3.321

Curent_y_pos = y1-6.475999832

then we can get the exact position of glyph "t".

In second highlighted circle the new Td positions (231.544 366.377990 Td) are completely replaced like

Current_x_pos = 231.544

Curent_y_pos = 366.377990

Along with that some times the parent tag is Tm at that case the formula might be like this

Current_x_pos = x1+(tdx1*font_size)

Curent_y_pos = y1+(tdy1*font_size)

When we need to multiply like above, and some times addition. Programatically how can I know this. To parse exact positions?(new screen short added for multiplication)

Any help ? Thanks. enter image description here enter image description here


Solution

  • When we need to multiply like above, and some times addition. Programatically how can I know this. To parse exact positions?

    It's quite simple, for a Td operation you always multiply, see the specification ISO 32000-1 (similarly in ISO 32000-2):

    excerpt from ISO 32000-1

    For a freshly initialized (i.e. identity) text line matrix Tlm this matrix multiplication looks like replacing its bottom row with tx ty 1.

    For a text line matrix Tlm with only changes in the bottom row against an identity this matrix multiplication looks like an addition to the bottom row, e.g. x y 1 becomes x+tx y+ty 1.

    For a text line matrix Tlm like in your second example

    a 0 0
    0 a 0
    x y 1
    

    this matrix multiplication looks like a multiplication with a followed by an addition to the bottom row, i.e. x y 1 becomes x+a·tx y+a·ty 1. If the font size parameter of the preceding Tf operation was 1, then a would effectively be the resultant font size giving rise to your assumption the font size is part of the formula.

    In general, for an arbitrary, non-degenerate text line matrix Tlm

    a b 0
    c d 0
    x y 1
    

    this matrix multiplication looks even more complex, x y 1 becomes x+a·tx+c·ty y+b·tx+d·ty 1.

    Thus, concerning your question

    Programatically how can I know this. To parse exact positions?

    your program should simply always use matrix multiplication and ignore what it looks like on the level of the separate coordinates.


    What makes the second circled instruction look like a mere replacement, is that the prior text line matrix is the identity matrix. This is not due to the restore-state operation as assumed by François, though, but more simply to the start of text object operation BT:

    excerpt from ISO 32000-1

    As the text matrix and the text line matrix are reset at the start of a text object and the graphics state cannot be saved or restored in a text object, the save and restore graphics state operations are not to blame in this case.

    (Screen shots are from the ISO 32000-1 copy shared by Adobe.)