pythonocrpython-tesseract

Why do I get nothing in output with pytesseract?


I have installed language support for chi_sim:

 ls  /usr/share/tesseract-ocr/5/tessdata
chi_sim.traineddata  eng.traineddata  pdf.ttf
configs          osd.traineddata  tessconfigs

You can try it by downloading photo.jpeg and using the following code:

import cv2
from PIL import Image
import pytesseract
from pyocr import tesseract
image_path = 'photo.jpeg'
image = cv2.imread(image_path)
image = Image.fromarray(image)
text = pytesseract.image_to_string(image, lang='chi_sim')
print(text)

Why do I get nothing in output with above code?

>>> print(pytesseract.get_languages(config=''))
['chi_sim', 'eng', 'osd']

Solution

  • That image as it stands is simply too poor for tesseract to see clear characters. It would need to be rectified and contrast improved and colour thresholding to remove the background noise.

    So this image shows how some of those might be rectified. However what is left is still simply below par for ordinary OCR.

    enter image description here

    So why can some systems see that image and generate good text, like this:

    中华人民共和国
    居民身份证
    签发机关
    有效期限
    2007.05.14-2027.05 14
    

    And the answer is aggregation of many similar images where they can see an average above par.

    enter image description here

    Even If you clean an image as good as this. Tesseract will still not come as close to an Artificially Improved interpretation.

    enter image description here