pythonpdfpypdfpdf-parsing

Why am I getting two different sets of coordinates from parsing a pdf file?


So, I am trying to parse a PDF file (30000x2000 points), using Python, that has all kinds of data on it, tables, lines, text, notes, images, etc. The goal: find a certain text string on the pdf and return a note that is a proximity to the text. I am using PyPDF2 to find all the notes and their coordinates. To find text strings and their coordinates I am using fitz.

Using fitz, I searched for 'A715_X1'.

import fitz    
doc = fitz.open(path_pdf)
    for page in doc:
        coordinates_of_item_found_on_print = page.search_for('A715_X1')

Result: X coordinate: 9076 points

Then, using PyPDF2, I searched for the 'Some text here' note.

from PyPDF2 import PdfReader

reader = PdfReader(sample.pdf)
for page in reader.pages:
    if "/Annots" in page:
        for annot in page["/Annots"]:
            obj = annot.get_object()
            markup_coordinates = obj['/Rect']
            if obj['/Subtype'] == '/FreeText': # skip Stamp, Popup, Square, PolyLine
                if obj['/Contents'] == 'Some text here':
                    try:
                        markup_loc.append(str(round(markup_coordinates[0])))
                        print('Note: ' + obj['/Contents'] + ' X-coord: ' + str(round(markup_coordinates[0])))
                    except Exception as e:
                        print(e)

Result: Note: Some text here X-coord: 4280

I attached a screenshot from PDFXchange showing 'A715_X1' right below the 'Some text' note, so the X coordinates should be roughly the same +- a couple of points. By looking at the print, 9076 seems like a "real" value according to the ruler lines. So why am I getting two different coordinates?

Example pdf:

example pdf


Solution

  • Without your minimal sample I had to emulate a blank file, luckily your image gives enough info for trial and error to calculate positions and orientation.

    So here is that emulation. We can see the /MediaBox size is identical. That horizontal dimension is outside the allowed size for a traditional page, and thus Acrobat kept complaining at my earlier attempts.

    I went through a dozen attempts to get it closest to yours. Clearly my annotation will be slightly different without your existing transformation but I am within a few points.

    enter image description here

    So what did I know and learn in the attempt to get the correct differentials ?

    1. I know a page traditionally cannot be wider than 14,400 points. Thus to emulate a larger Space we need to scale that size media upwards and thus in this case that maximal 14400 (PaperSpace) is allocated a UserUnit of 2.12

    Emulated values should be close to OP units.

    12 0 obj
    <</Annots[14 0 R]/Contents 15 0 R/MediaBox[0 0 14400 1124.15]/Parent 6 0 R/Resources<</ExtGState<</GS0 16 0 R>>>>/Type/Page/UserUnit 2.12>>
    endobj
    

    Now we can see, that if we multiply the annotations true placement (4281pt without scale) by that scalar we get 4281x2.12=9,075.72 which is the NOMINAL offset in X for the top left-hand corner in the scaled /Media.

    Annotations (including form fields) in effect, are not tied in any way to the scale or the datum of the page contents. They can have their own co-ordinate scalar values and thus scaling/rotating/moving the page content will often break the relative placements. Luckily in this case they are using a known historic system.