[SOLVED] How to generate lstmf from .box and .tif files in tesseract 5 alpha lstm training

How to generate lstmf from .box and .tif files in tesseract 5 alpha lstm training

I am using the current alpha version 5 of tesseract. Currently, I am trying to train using images without font files. I managed to generate box files from the image using the following command.

tesseract image.tif imagebox -l ara wordstrbox

After this step, I will be fixing the errors in the OCR. Then what I need is to convert the box file and tif into .lstmf file.

I cannot find any guidance on how I can do this. All that there is the following: OCR training documentation

The training data is provided via .lstmf files, which are serialized DocumentData They contain an image and the corresponding UTF8 text transcription, and can be generated from tif/box file pairs using Tesseract in a similar manner to the way .tr files were created for the old engine.

Please advise on how to convert tif and box into lstmf at this stage.

Thanks,

Solution

Found it,

tesseract image.tif training --psm 6 lstm.train

but the box file name should be the same as the image file name.