I am trying to extract all formats of images from pdf. I did some googling and found this page on StackOverflow. I tried this code but I am getting this error:
I am using python 3.x and here is the code I am using. I tried to go through comments but couldn't figure out. Please help me resolve this.
Here is the sample PDF.
import PyPDF2
from PIL import Image
if __name__ == '__main__':
input1 = PyPDF2.PdfFileReader(open("Aadhaar1.pdf", "rb"))
page0 = input1.getPage(0)
xObject = page0['/Resources']['/XObject'].getObject()
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
data = xObject[obj].getData()
if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
mode = "RGB"
else:
mode = "P"
if xObject[obj]['/Filter'] == '/FlateDecode':
img = Image.frombytes(mode, size, data)
img.save(obj[1:] + ".png")
elif xObject[obj]['/Filter'] == '/DCTDecode':
img = open(obj[1:] + ".jpg", "wb")
img.write(data)
img.close()
elif xObject[obj]['/Filter'] == '/JPXDecode':
img = open(obj[1:] + ".jp2", "wb")
img.write(data)
img.close()
I was reading some comments and going through links and found this problem solved on this page. Can someone please help me implement it?
It is the PyPDF2
library error. Try uninstalling and installing the library with changes or you can see the changes in the GitHub and mark the changes.I hope that will work.