[SOLVED] Is there a way to measure the margins of a pdf file using image processing in python?

Is there a way to measure the margins of a pdf file using image processing in python?

Up to now, I have used different types of packages in Python to extract information from PDF files, but I couldn't find a way to measure the margins of the PDF document. As an example, I wanted to get the size of the margins on the four sides of the pdf page.

I have used pdfcrop package to remove the white spaces(margins) but couldn't measure the margin size.

Solution

Example using PyMuPDF

import fitz  # package PyMuPDF
doc = fitz.open("input.pdf")

for page in doc:
    rect = fitz.EMPTY_RECT()  # prepare an empty rectangle
    blocks = page.get_text("blocks")  # extract text and image blocks
    for b in blocks:
        rect |= b[:4]  # join rect with the block bbox
    print(f"Page {page.number} margins:")
    print(f"    Top {rect.y0}")
    print(f"   Left {rect.x0}")
    print(f"  Right {page.rect.width - rect.x1}")
    print(f" Bottom {page.rect.height - rect.y1}")

If someone is interested to extend the above to all possible page content - nd not just text or image, use the even faster method page.get_bboxlog(). This won't read or extract anything, instead only returns covered rectangles. The above loop would then be:

...
for page in doc:
    rect = fitz.EMPTY_RECT()  # prepare an empty rectangle
    for item in page.get_bboxlog():
        b = item[1]  # the bbox of the covered area
        rect |= b[:4]  # join rect with the block bbox

    print(f"Page {page.number} margins:")
    print(f"    Top {rect.y0}")
    print(f"   Left {rect.x0}")
    print(f"  Right {page.rect.width - rect.x1}")
    print(f" Bottom {page.rect.height - rect.y1}")