pythonpython-imaging-librarypypdfpdf-extraction

Error while image extraction from PDF in python


I am trying to extract all formats of images from pdf. I did some googling and found this page on StackOverflow. I tried this code but I am getting this error:

I am using python 3.x and here is the code I am using. I tried to go through comments but couldn't figure out. Please help me resolve this.

Here is the sample PDF.

import PyPDF2

from PIL import Image

if __name__ == '__main__':
    input1 = PyPDF2.PdfFileReader(open("Aadhaar1.pdf", "rb"))
    page0 = input1.getPage(0)
    xObject = page0['/Resources']['/XObject'].getObject()

    for obj in xObject:
        if xObject[obj]['/Subtype'] == '/Image':
            size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
            data = xObject[obj].getData()
            if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
                mode = "RGB"
            else:
                mode = "P"

            if xObject[obj]['/Filter'] == '/FlateDecode':
                img = Image.frombytes(mode, size, data)
                img.save(obj[1:] + ".png")
            elif xObject[obj]['/Filter'] == '/DCTDecode':
                img = open(obj[1:] + ".jpg", "wb")
                img.write(data)
                img.close()
            elif xObject[obj]['/Filter'] == '/JPXDecode':
                img = open(obj[1:] + ".jp2", "wb")
                img.write(data)
                img.close()

I was reading some comments and going through links and found this problem solved on this page. Can someone please help me implement it?


Solution

  • It is the PyPDF2 library error. Try uninstalling and installing the library with changes or you can see the changes in the GitHub and mark the changes.I hope that will work.