I'm running a script that uses pdfminer to split pages and analyze documents in a page by page basis. My script goes page by page like this:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO
from pdfminer.pdfdocument import PDFDocument
import pytesseract
fp = open(pdf_path, 'rb')
data = []
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
for pageNumber, page in enumerate(PDFPage.get_pages(fp)):
# Read PDF page, write text into stream
interpreter.process_page(page)
text = retstr.getvalue()
However, sometimes I get pdfs that are image based and my text
variable gets empty. I couldn't find a "convert_image_to_string" pdfminer function so I found an option with pdf2image
for pageNumber, page in enumerate(PDFPage.get_pages(fp)): #previous code
# Read PDF page, write text into stream #previous code
interpreter.process_page(page) #previous code
text = retstr.getvalue() #previous code
if len(text)<100: #new code
from pdf2image import convert_from_path #new code
img=convert_from_path(page,350) #new code
text=pytesseract.image_to_string(page) #new code
But I need to input a file path in pdf2image.convert_from_path and since my previous code has a pdfminer page object as output, the return I get is TypeError: expected str, bytes or os.PathLike object, not PDFPage
. So, I would very much appreciate a suggestion to:
a) Use pdfminer to convert image pdfs to text or;
b) Use pdfminer to save pdf page somewhere in a way I could use the file_path as input to pdf2image.covert_from_path
Well, no one answered and the workaround I found was giving up on pdf miner and focusing on pdf2image and pytesseract. Hope it helps someone with the same prob.
from pdf2image import convert_from_path
import pytesseract
pdf=PdfFileReader(pdf_path)
numpages=pdf.getNumPages()
for pageNumber in range(numpages):
page = pdf.getPage(pageNumber)
text=page.extractText()
if len(text)<100:
pdfWriter=PdfFileWriter()
pdfWriter.addPage(pdf.getPage(pageNumber))
with open("pdfpage.pdf", 'wb') as f:
pdfWriter.write(f)
f.close()
imgpath="/Users/pdfpage.pdf"
img=convert_from_path(imgpath,350)[0]
try:text=pytesseract.image_to_string(img)
except:text="no text"