I'm still new to Tesseract OCR and after using it in my script noticed it had a relatively big error rate for the images I was trying to extract text from. I came across Tesseract training, which supposedly would be able to decrease error rate for a specific font you'd use. I came across a website (http://ocr7.com/) which is a tool powered by Anyline to do all the training for a font you specify. So I recieved a .traineddata file and I am not quite sure what to do with it. Could anybody explain what I have to do with this file for it to work? Or should I just learn how to do Tesseract training the manual way, which according to the Anyline website may take a day's work. Thanks in advance.
This might be a late responde, but for the question shows up on Google.
Newer versions of Tesseract come shipped with a bunch of tools to make this really easy, without having to do manual work with a box editor.
text2image lets you generate both the .tif file and its respective .box file for use with tesstrain.
text2image \
--font=Font Name \
--fonts_dir=Optional Fonts Dir \
--text=path/to/textfile
--outputbase=path/to/output
--max_pages=1 \
--leading=32 \
--xsize=3600 \
--ysize=480 \
--char_spacing=1.0 \
--exposure=0 \
--unicharset_file=path/to/unicharset
I believe the --unicharset_file parameter may be optional.