Up to now, I have used different types of packages in Python to extract information from PDF files, but I couldn't find a way to measure the margins of the PDF document. As an example, I wanted to get the size of the margins on the four sides of the pdf page.
I have used pdfcrop package to remove the white spaces(margins) but couldn't measure the margin size.
Example using PyMuPDF
import fitz # package PyMuPDF
doc = fitz.open("input.pdf")
for page in doc:
rect = fitz.EMPTY_RECT() # prepare an empty rectangle
blocks = page.get_text("blocks") # extract text and image blocks
for b in blocks:
rect |= b[:4] # join rect with the block bbox
print(f"Page {page.number} margins:")
print(f" Top {rect.y0}")
print(f" Left {rect.x0}")
print(f" Right {page.rect.width - rect.x1}")
print(f" Bottom {page.rect.height - rect.y1}")
If someone is interested to extend the above to all possible page content - nd not just text or image, use the even faster method page.get_bboxlog()
. This won't read or extract anything, instead only returns covered rectangles.
The above loop would then be:
...
for page in doc:
rect = fitz.EMPTY_RECT() # prepare an empty rectangle
for item in page.get_bboxlog():
b = item[1] # the bbox of the covered area
rect |= b[:4] # join rect with the block bbox
print(f"Page {page.number} margins:")
print(f" Top {rect.y0}")
print(f" Left {rect.x0}")
print(f" Right {page.rect.width - rect.x1}")
print(f" Bottom {page.rect.height - rect.y1}")