I have a PDF document that I want to scan with pytesseract, but the page numbers are not recognized. The page number is not recognized on any of the pages. The PDF is written with Latex. I ried different psm, but it did not help. What can I do that tesseract recognizes the page numbers?
The PDF document gets uploaded via streamlit and is passed to the function as a BytesIO like object. The function returns an array of words (strings).
Code:
def get_text_from_ocr(uploaded_file):
images = []
config = r"--psm 3" # 3: Fully automatic page segmentation, but no OSD. (Default)
# pdf to images
uploaded_file.seek(0)
pdf_bytes = uploaded_file.read()
doc = pymupdf.open(stream=pdf_bytes, filetype="pdf")
for page in doc:
pix = page.get_pixmap(dpi=300)
img = Image.open(BytesIO(pix.tobytes("png")))
images.append(img)
# Do OCR
text = [word for img in images for word in pytesseract.image_to_string(img, config=config).split()]
return text
I also tried some preprocessing, but it did not help either (convert to binary and enlarge the image).
Code:
def get_text_from_ocr(uploaded_file):
images = []
config = r"--psm 3" # 3: Fully automatic page segmentation, but no OSD. (Default)
# pdf to images
uploaded_file.seek(0)
pdf_bytes = uploaded_file.read()
doc = pymupdf.open(stream=pdf_bytes, filetype="pdf")
for page in doc:
pix = page.get_pixmap(dpi=300)
img = Image.open(BytesIO(pix.tobytes("png")))
# Preprocessing
gray = img.convert("L") # "L" = 8-bit grayscale
# Tune threshold value as needed (e.g., 180, 200)
binary = gray.point(lambda x: 0 if x < 180 else 255, '1') # '1' mode = black & white
scale = 2
resized = img.resize((img.width * scale, img.height * scale), Image.LANCZOS)
images.append(resized)
# Do OCR
text = [word for img in images for word in pytesseract.image_to_string(img, config=config).split()]
return text
PDF: PDF document
I want to scan with pytesseract, but the page numbers are not recognized. The page number is not recognized on any of the pages.
Utilizing Windows 10 and Python 3.13.3
Change this:
config = r"--psm 3" # 3
To:
config = r"--psm 6 --oem 3 -l eng"