I am using the current alpha version 5 of tesseract. Currently, I am trying to train using images without font files. I managed to generate box files from the image using the following command.
tesseract image.tif imagebox -l ara wordstrbox
After this step, I will be fixing the errors in the OCR. Then what I need is to convert the box file and tif into .lstmf file.
I cannot find any guidance on how I can do this. All that there is the following: OCR training documentation
The training data is provided via .lstmf files, which are serialized DocumentData They contain an image and the corresponding UTF8 text transcription, and can be generated from tif/box file pairs using Tesseract in a similar manner to the way .tr files were created for the old engine.
Please advise on how to convert tif and box into lstmf at this stage.
Thanks,
Found it,
tesseract image.tif training --psm 6 lstm.train
but the box file name should be the same as the image file name.