pythonpdfpymupdfimage-extraction

Extracting Images from a PDF using PyMuPDF gives broken output images


The code I am using to extract the images is

from PIL import Image


def extract_images_from_pdfs(pdf_list):
    import fitz  # PyMuPDF
    
    output_dir = "C:/path_to_image"
    os.makedirs(output_dir, exist_ok=True)
    
    for pdf_path in pdf_list:
        pdf_name = os.path.splitext(os.path.basename(pdf_path))[0]
       
        # Open the PDF
        pdf_document = fitz.open(pdf_path)
        
        # Track the count of images extracted per page
        image_count = 0
        
        for page_num, page in enumerate(pdf_document):
            # Get the images on this page
            image_list = page.get_images(full=True)
            
            if not image_list:
                print(f"No images found on page {page_num+1} of {pdf_name}")
                continue
            
            # Process each image
            for img_index, img in enumerate(image_list):
                xref = img[0]
                base_image = pdf_document.extract_image(xref)
                
                if base_image:
                    image_bytes = base_image["image"]
                    image_ext = base_image["ext"]
                    
                    # Convert bytes to image
                    image = Image.open(io.BytesIO(image_bytes))
                    
                    # Save the image
                    image_name = f"{pdf_name}_image_{image_count}.{image_ext}"
                    image_path = os.path.join(output_dir, image_name)
                    
                    image.save(image_path)
                    
                    image_count += 1
        
        pdf_document.close()
        print(f"Extracted {image_count} images from {pdf_name}")

The input, pdf_list, is just a list containing all the names of my pdf's.

Extracted image 1

enter image description here

Extracted image 2

enter image description here

Expected image:

enter image description here

Could it be that the images on the PDF are encrypted / accessible and is there a work around for this.

Any help is greatly appreciated.

testingpdfexampaper.tiiny.site This is the URL for the PDF


Solution

  • The PDF has 78 very small pieces of imagery of which the "largest" is masking for O on the first page: enter image description here

     1    60 image      81    62  index   1   8  image  no       271  0   151   151 1996B  40%
    

    And many are simply one single pixel.
    They can be in any order and the early ones of the 78 are generally parts of R:
    OP 1st image enter image description here

    pdfimages -list chem.pdf
    page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
    --------------------------------------------------------------------------------------------
       1     0 image       4    26  cmyk    4   8  image  no       214  0   163   153   77B  19%
       1     1 image       2     2  cmyk    4   8  image  no       215  0   204   245   21B 131%
       1     2 image       7    59  index   1   8  image  no       226  0   306   303   53B  13%
       1     3 image      60    39  index   1   8  image  no       237  0   150   153  819B  35%
       1     4 image       1     1  cmyk    4   8  image  no       248  0   204   204   14B 350%
       1     5 image       9     4  cmyk    4   8  image  no       259  0   162   153   74B  51%
       1     6 image      58    31  index   1   8  image  no       270  0   150   154  526B  29%
       1     7 image       4     3  cmyk    4   8  image  no       281  0   153   153   38B  79%
       1     8 image       2     2  cmyk    4   8  image  no       290  0   153   175   24B 150%
    

    NOTE there is common with many PDF constructions no "one to one" relationship.
    One text line can be many places and one visible line can be multiple paths too.

    enter image description here

    Thus image extraction is of no real value as any whole page could be exported as single images, then trimmed to desired areas, at any density/quality you wish.

    enter image description here

    Python has PyMuPDF which can "gather" "paths" and combine into single graphical units. So if you select an area of inclusions (Region of Interest) they can possibly be reused as vectors elsewhere?

    This is similar in effect to the way the MuPDF command line can with a few well chosen commands export SVG areas for reuse.

    enter image description here