pythonpdfimage-processingnlphtml-content-extraction

Is there a way to measure the margins of a pdf file using image processing in python?


Up to now, I have used different types of packages in Python to extract information from PDF files, but I couldn't find a way to measure the margins of the PDF document. As an example, I wanted to get the size of the margins on the four sides of the pdf page.

I have used pdfcrop package to remove the white spaces(margins) but couldn't measure the margin size.


Solution

  • Example using PyMuPDF

    import fitz  # package PyMuPDF
    doc = fitz.open("input.pdf")
    
    for page in doc:
        rect = fitz.EMPTY_RECT()  # prepare an empty rectangle
        blocks = page.get_text("blocks")  # extract text and image blocks
        for b in blocks:
            rect |= b[:4]  # join rect with the block bbox
        print(f"Page {page.number} margins:")
        print(f"    Top {rect.y0}")
        print(f"   Left {rect.x0}")
        print(f"  Right {page.rect.width - rect.x1}")
        print(f" Bottom {page.rect.height - rect.y1}")
    

    If someone is interested to extend the above to all possible page content - nd not just text or image, use the even faster method page.get_bboxlog(). This won't read or extract anything, instead only returns covered rectangles. The above loop would then be:

    ...
    for page in doc:
        rect = fitz.EMPTY_RECT()  # prepare an empty rectangle
        for item in page.get_bboxlog():
            b = item[1]  # the bbox of the covered area
            rect |= b[:4]  # join rect with the block bbox
    
        print(f"Page {page.number} margins:")
        print(f"    Top {rect.y0}")
        print(f"   Left {rect.x0}")
        print(f"  Right {page.rect.width - rect.x1}")
        print(f" Bottom {page.rect.height - rect.y1}")