I am using pytesseract to OCR on images. I have statement pdf that are 3-4 page long. I need a way to convert them into multiple .jpg/.png images and to OCR on these images one by one. As of now, I am converting a single page to image and then I run
text=str(pytesseract.image_to_string(Image.open("imagename.jpg"),lang='eng'))
after which I use regex to extract information and create a dataframe. The regex logic is same for all the pages. Understandably if I can read the image files in a loop, the process can be automated for any pdf coming in same format.
PyMuPDF would be another option for you to loop through image files. Here is how you can achieve this:
import fitz
from PIL import Image
import pytesseract
input_file = 'path/to/your/pdf/file'
pdf_file = input_file
fullText = ""
doc = fitz.open(pdf_file) # open pdf files using fitz bindings
### ---- If you need to scale a scanned image --- ###
zoom = 1.2 # scale your pdf file by 120%
mat = fitz.Matrix(zoom, zoom)
noOfPages = doc.pageCount
for pageNo in range(noOfPages):
page = doc.loadPage(pageNo) # number of pages
pix = page.getPixmap(matrix = mat) # if you need to scale a scanned image
output = '/path/to/save/image/files' + str(pageNo) + '.jpg'
pix.writePNG(output) # skip this if you don't need to render a page
text = str(((pytesseract.image_to_string(Image.open(output)))))
fullText += text
fullText = fullText.splitlines() # or do something here to extract information using regex
It's very handy depending on how you wanted to do with pdf files. For a more detailed information about PyMuPDF, these links might be helpful: tutorial on PyMuPDF and git for PyMuPDF
Hope this helps.
EDIT
Another more straightforward way of doing this using PyMuPDF is to directly interpret the back-converted text if you have a clean format of PDF files, after page = doc.loadPage(pageNo)
just do the following is suffice:
blocks = page.getText("blocks")
blocks.sort(key=lambda block: block[3]) # sort by 'y1' values
for block in blocks:
print(block[4]) # print the lines of this block
Disclaimer: The above idea of using blocks
was coming from the repo maintainer. A more detailed info can be found here: issues discussion on git