So, I am trying to parse a PDF file (30000x2000 points), using Python, that has all kinds of data on it, tables, lines, text, notes, images, etc. The goal: find a certain text string on the pdf and return a note that is a proximity to the text. I am using PyPDF2 to find all the notes and their coordinates. To find text strings and their coordinates I am using fitz.
Using fitz, I searched for 'A715_X1'.
import fitz
doc = fitz.open(path_pdf)
for page in doc:
coordinates_of_item_found_on_print = page.search_for('A715_X1')
Result: X coordinate: 9076 points
Then, using PyPDF2, I searched for the 'Some text here' note.
from PyPDF2 import PdfReader
reader = PdfReader(sample.pdf)
for page in reader.pages:
if "/Annots" in page:
for annot in page["/Annots"]:
obj = annot.get_object()
markup_coordinates = obj['/Rect']
if obj['/Subtype'] == '/FreeText': # skip Stamp, Popup, Square, PolyLine
if obj['/Contents'] == 'Some text here':
try:
markup_loc.append(str(round(markup_coordinates[0])))
print('Note: ' + obj['/Contents'] + ' X-coord: ' + str(round(markup_coordinates[0])))
except Exception as e:
print(e)
Result: Note: Some text here X-coord: 4280
I attached a screenshot from PDFXchange showing 'A715_X1' right below the 'Some text' note, so the X coordinates should be roughly the same +- a couple of points. By looking at the print, 9076 seems like a "real" value according to the ruler lines. So why am I getting two different coordinates?
Example pdf:
Without your minimal sample I had to emulate a blank file, luckily your image gives enough info for trial and error to calculate positions and orientation.
So here is that emulation. We can see the /MediaBox size is identical. That horizontal dimension is outside the allowed size for a traditional page, and thus Acrobat kept complaining at my earlier attempts.
I went through a dozen attempts to get it closest to yours. Clearly my annotation will be slightly different without your existing transformation but I am within a few points.
So what did I know and learn in the attempt to get the correct differentials ?
Emulated values should be close to OP units.
12 0 obj
<</Annots[14 0 R]/Contents 15 0 R/MediaBox[0 0 14400 1124.15]/Parent 6 0 R/Resources<</ExtGState<</GS0 16 0 R>>>>/Type/Page/UserUnit 2.12>>
endobj
Now we can see, that if we multiply the annotations true placement (4281pt without scale) by that scalar we get 4281x2.12=9,075.72 which is the NOMINAL offset in X for the top left-hand corner in the scaled /Media.
Annotations (including form fields) in effect, are not tied in any way to the scale or the datum of the page contents. They can have their own co-ordinate scalar values and thus scaling/rotating/moving the page content will often break the relative placements. Luckily in this case they are using a known historic system.