I'm trying to extract images from a PDF file using pdfminer.six
There doesn't seem to be any documentation about how to do this with Python.
This is what I have so far:
import os
import pdfminer
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
os.chdir('C:\\Users\\zone_\\Desktop')
diretorio = os.getcwd()
file = str(diretorio) + '\\example.pdf'
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
This is extracting the text, but how to retrieve the images in the pdf?
I have never used pdfminer, however I found this code and this document from Denis Papathanasiou explaining it, which might be of some help to figure this out, as pdfminer's documentation is not very exhaustive. The document is from an outdated version, but the code was recently updated.
If you are not required to use pdfminer, there are alternatives which might be easier such as PyMuPDF found in this answer which extracts all images in the PDF as PNG.