pythonpython-tesseract

Pytesseract / Recoginizing chars + digits + spaces


i would like to recognize some text (with digits and spaces) from a image using the following code:

erg = pytesseract.image_to_string(img)

Generally this works fine with that but i also get character i don´t want like Ô

ÔAU OPTRONICS CORPORATION

() Preliminary Specification
(V) Final Specification
Module 18.5" Color TFT-LCD
Model Name (G18SHANOT.O
Customer Date ÔApproved by Date
Crystal Hsieh 2016/06/29
Approved by Propared by

So i tried to whitelist tesseract using the following code instead:

workString =f'-c tessedit\_char\_whitelist={string.digits}(){string.ascii\_letters}' 
erg = pytesseract.image\_to\_string(img, config=workString)

With that i get the following text - so it seems that Ô is not outputted - but unfortunately have no spaces anymore -

AUOPTRONICSCORPORATION

()ProliminarySpecification
(V)FinalSpecification
Module 185ColorTFTLCD
ModelName (G18SHANOTO
Customer Date Approvedby Date
CrstalHsieh 2016(06)29
Approvedby Proparedby

Is there any way to whitelist the characters and digits but also still output the spaces / blanks?


Solution

  • config = f"-c tessedit_char_whitelist='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.#-:/ '"
    

    Try this. I added a space within the inner quotes when I was having a similar issue, but this approach worked for me (space is the last character in the string). Feel free to add/remove any characters you want tesseract to include/exclude.