My code opens a pdf, converts the first page to an image, then cuts rectangles out of this image by coordinates and extracts text from each cropped rectangle using Tesseract.
I discovered that in some cases for larger images OCR performs much worse than in other cases.
After playing around with Tesseract in the command line, I also discovered that for some images Tesseract estimates the resolution itself which affects the result.
I also played around with the --dpi parameter. For some images the best results were obtained with --dpi 1800, for some with --dpi 300. I'm looking for a way to set the dpi for my images before extracting text or a way to find the dpi of my images.
I also tried to use pix.set_dpi()
and get_pixmap(dpi = ..)
and that didn't improve anything. I would be thankful for any suggestions
Here is the code I use:
page = doc.load_page(0)
page_size = page.rect
zoom = 3
mat = fitz.Matrix(zoom, zoom)
pix = page.get_pixmap(matrix=mat)
img_data = pix.samples
img_array = np.frombuffer(img_data, dtype=np.uint8)
img_array = img_array.reshape(pix.height, pix.width, pix.n)
img = cv.cvtColor(img_array, cv.COLOR_RGB2BGR)
#...
k=0
result_dict = {}
for i, rect in enumerate(rectangles):
x1, y1, x2, y2 = rect
roi = img[y1:y2, x1:x2]
k+=1
text = pytesseract.image_to_string(roi, lang="eng+deu")
Only OCR a region of a PDF page like this:
import fitz
doc = fitz.open("input.pdf")
page = doc[pno] # 0-based page number
rect = fitz.Rect(x0, y0, x1, y1) # an area on the page
pix = page.get_pixmap(clip=rect, dpi=150)
# make a 1-page temp PDF from the area and OCR it
ocr = fitz.open("pdf", pix.pdfocr_tobytes()) # 1-page temp PDF
ocrpage = ocr[0]
text = ocrpage.get_text() # OCRed text