I'm using pytesseract to read tabular data out of an image but I'm having trouble with the software making "educated guesses" about characters and word splitting based on context.
I have a specific example I'd like to solve. If I whitelist the $
character then the word splitting gives me this for one line of text:
['Total', '$8,644.27', '$9,653.70']
But If I blacklist the character $
and make no other changes I get this unwanted split near the first comma (and the comma itself is missing):
['Total', '8', '644.27', '9,653.70']
I could just remove the $
after tesseract runs but unfortunately I deliberately excluded $
because then tesseract will often turn symbols like S1
into $1
which is a related, equally annoying change.
It will also get a number wrong sometimes if there is a similar number nearby.
It seems tesseract is trying to be clever under the hood and making LLM style guesses but the thing is I have a very high definition source image so I'd rather tesseract focus less on recognizing words/context and more on just identifying the characters based on their outlines.
The current options I have are:
VALID_CHARS = string.digits + string.ascii_letters + '$.,<>\\/#%()*@&: +-'
CUSTOM_TESSERACT_CONFIG = (
'--oem 3 --psm 6 '
f'-c tessedit_char_whitelist="{VALID_CHARS}" '
'-c tessedit_enable_dict_correction=0 '
'-c load_system_dawg=0 '
'-c load_freq_dawg=0 '
'-c load_punc_dawg=0 '
'-c load_number_dawg=0 '
'-c load_unambig_dawg=0 '
'-c load_bigram_dawg=0 '
'-c load_fixed_length_dawgs=0 '
'-c wordrec_enable_assoc=0 '
'-c language_model_penalty_non_freq_dict_word=0 '
'-c language_model_penalty_non_dict_word=0 '
'-c tessedit_prefer_joined_punct=1 '
'-c textord_enable_word_ngrams=0 '
'-c tessedit_good_quality_unrej=1 '
'-c tessedit_enable_bigram_correction=0 '
'-c tessedit_enable_doc_dict=0 '
'-c textord_enable_out_of_punct=0 '
'-c textord_enable_xheight_stats=0 '
'-c enable_noise_removal=0 '
'-c classify_enable_adaptive_matcher=0 '
'-c classify_enable_learning=0 '
'-c tessedit_preserve_blk_rej_perfect_wds=1 '
'-c preserve_interword_spaces=1 '
'-c segment_penalty_dict_case=0 '
'-c segment_penalty_garbage=0 '
'-c textord_split_num_pattern=0'
)
I'm not even sure if these options are doing anything or if I need to retrain a model or something.
I need the character or word boundaries not just the text as I have to group words/chars based on their bounding boxes.
I'm really only interested in character recognition (latin alphanumeric and punctuation) and splitting on whitespace only. As long as I get the x,y,w,h coords of each word. I don't want tesseract to change characters based on surrounding characters or punctuation or number formats or frequencies or dictionaries or whatever else it's doing under the hood.
Use the legacy Tesseract mode (--oem 0
or 1
) for character OCR.
Switch to --oem 1
(legacy engine) — often better for character-level OCR and gives fewer “smart” guesses.