I have installed language support for chi_sim
:
ls /usr/share/tesseract-ocr/5/tessdata
chi_sim.traineddata eng.traineddata pdf.ttf
configs osd.traineddata tessconfigs
You can try it by downloading photo.jpeg and using the following code:
import cv2
from PIL import Image
import pytesseract
from pyocr import tesseract
image_path = 'photo.jpeg'
image = cv2.imread(image_path)
image = Image.fromarray(image)
text = pytesseract.image_to_string(image, lang='chi_sim')
print(text)
Why do I get nothing in output with above code?
>>> print(pytesseract.get_languages(config=''))
['chi_sim', 'eng', 'osd']
That image as it stands is simply too poor for tesseract to see clear characters. It would need to be rectified and contrast improved and colour thresholding to remove the background noise.
So this image shows how some of those might be rectified. However what is left is still simply below par for ordinary OCR.
So why can some systems see that image and generate good text, like this:
中华人民共和国
居民身份证
签发机关
有效期限
2007.05.14-2027.05 14
And the answer is aggregation of many similar images where they can see an average above par.
Even If you clean an image as good as this. Tesseract will still not come as close to an Artificially Improved interpretation.