python ocr tesseract python-tesseract image-preprocessing

Wrong numbers retrieved in pytesseract

I'm trying to retrieve data from an online image with pytesseract, however the result is pretty bad and I was wondering if there was a way to improve it.

Here is my code:

import io
import requests
import pytesseract
from PIL import Image
response = requests.get("https://port.jpx.co.jp/jpx/chart/chart21.exe?template=ini/DayIndexCSV&basequote=151_2024&begin=2024/4/2&end=2024/04/02&mode=D")
img = Image.open(io.BytesIO(response.content))
text = pytesseract.image_to_string(img)
print(text)
img.show()

#Updated black and white version
img.save('1.png')
image = cv2.imread('1.png',0)
_, thresh1 = cv2.threshold(image, 105, 255, cv2.THRESH_BINARY)
_, thresh2 = cv2.threshold(image, 106, 255, cv2.THRESH_BINARY_INV)
final_thresh = cv2.bitwise_and(thresh1, thresh2)
im = Image.fromarray(final_thresh)
text = pytesseract.image_to_string(im)
print(text)
im.show()

As you can see the output is really different, from the real text.

It looks like a lot of numbers are being changed to "8", it can be 2,4,5 or 6. also sometimes it gives "." for separator and sometimes ",".

Even if I select only a focus part, the answer is not better:

w, h= img.size
img2=img.crop((240, 185, w-220, h-220))
text = pytesseract.image_to_string(img2)
print(text)

The real value of the cropped image is "2,714.45" while this code returns "eT AS"

text = pytesseract.image_to_string(img2, config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')
print(text)

and this code returns only "1".

I don't understand really well how it works, I also tried to change the color according to what Use pytesseract OCR to recognize text from an image says, but it doesn't work either.

If someone have an idea on how I could make this work, that would be highly appreciated,

Thanks

Updated code:

from PIL import Image
import pytesseract
import requests
import io

response = requests.get("https://port.jpx.co.jp/jpx/chart/chart21.exe?template=ini/DayIndexCSV&basequote=151_2024&begin=2024/4/02&end=2024/04/02&mode=D")
img = Image.open(io.BytesIO(response.content))
grayscale_image = img.convert("L")
original_width, original_height = grayscale_image.size
factor = 300/72 # enlarge from default 72 to 300 dpi
new_width = int(original_width * factor)
new_height = int(original_height * factor)
enlarged_image = grayscale_image.resize((new_width, new_height))
threshold = 175 # found by trial/error (removes disturbing grid lines)
bw_image = enlarged_image.point(lambda x: 0 if x < threshold else 255, '1') # mode='1' -> b/w 

w, h= bw_image.size
bw_image=bw_image.crop((500, 0, w-300, h-480))
text = pytesseract.image_to_string(bw_image)
print(text)
bw_image.show()

Output: 2U2d Ud Ue 2714.45

271445 2024/04/02

Solution

Usual issue a beginner user of tesseract runs into is to expect that tesseract will easily OCR texts got from screenshots or images designed to be displayed on a computer monitor, but ... tesseract is primary designed to OCR texts scanned at 300 dpi given as black/white images. In other words you need to enlarge the image which is designed to be viewed at 72 dpi to a size giving 300 dpi and you need to threshold the image to a black/white one yourself to ensure removing of the OCR process disturbing grid lines in the diagram.

With this above implemented in code tessaract will deliver mostly the right results (except the used font has weird glyphs or there is antialias at small font size and the chosen threshold is not appropriate to clearly separate text letters).

Below example of code which implements the described approach to improve the result of tesseract OCR recognition:

from PIL import Image
import pytesseract

img = Image.open("GZYU5.png")
grayscale_image = img.convert("L")
grayscale_image.show()
original_width, original_height = grayscale_image.size
factor = 300/72 # enlarge from default 72 to 300 dpi
new_width = int(original_width * factor)
new_height = int(original_height * factor)
enlarged_image = grayscale_image.resize((new_width, new_height))
threshold = 175 # found by trial/error (removes disturbing grid lines)
bw_image = enlarged_image.point(lambda x: 0 if x < threshold else 255, '1') # mode='1' -> b/w 
bw_image.show()
text = pytesseract.image_to_string(bw_image)
print(text)

which prints:

2,740.00

2,735.00
2,730.00
2,725.00
2024/04/02
2,720.00 2,714.45
2,715.00
2,710.00 2714.45
2024/04/02
2,705.00
1,500
1,000
500
0
24/4/2

(C)QUICK Corp.

Notice that the order of to the image applied operations can have an impact on the result of OCR recognition. The provided code turns as first step the image to grayscale and enlarges it before converting it to black/white format for OCR recognition.

Notice also that:

though scaling to 300 dpi is a good first guess, in case of extremely small/large letters enlarging/shrinking the actual letter size instead the preferred option - see tesseract documentation – user898678

Below the correct result of tesseract OCR on a region with the relevant number cut out of the image along with the screenshot proving that lack of context does not have any impact on correctness of the OCR result:

2024/04/02
2714.45