My goal is - to add "Hand Writing" font, to Hebrew language.
I did succeed in creating files: .tif
and .box
, and then .tr
.
But not with creating the trained-data. I'm getting an Error :
Loaded file output/.tr, unpacking...
Failed to read continue from: output/.tr
Notes:
Help will be appreciated, please.
My Script :
#
text2image
--text="langdata_lstm/heb/heb.training_text"
--outputbase="output/2"
--font="Handwriting Regular"
--D="output"
--fonts_dir="fonts"
--max_pages="2"
#
tesseract "output/2.tif" "output/2" -l heb box.train.stderr
#
lstmtraining
--stop_training
--continue_from="output/2.tr"
--traineddata="tessdata/heb.traineddata"
--model_output="output/2.traineddata"
Output :
Loaded file output/.tr, unpacking...
Failed to read continue from: output/.tr
Installation ::
python-3.12.4-amd64.exe
tesseract-ocr-w64-setup-5.4.0.20240606.exe
tesstrain-windows-GUI-main.zip (https://codeload.github.com/buliasz/tesstrain-windows-gui/zip/refs/heads/main)
AutoHotkey_2.0.18_setup.exe (GUI's dependency) (https://www.autohotkey.com/download/ahk-v2.exe)
Directory structure ::
/app (tesseract-ocr-w64-setup-5.4.0.20240606.exe)
/gui (tesstrain-windows-gui-main.zip)
/langdata_lstm (github)
/tessdata (exist)
/tessdata_best (github)
/heb_hw
/data
/gt
Lastly, on GUI, click on "Re-check requirements".
Explanation ::
Steps ::
note: it uses 'app/langdata_lstm'.
note: I had to install tff (font type) in my Windows (The app can't just read them from a library)
python heb_hw/gt.py
set 'tessData folder' to 'app\tessdata_best' note: the installed variant doesn't allow appending ('best') set 'Input ground truth dir' to 'heb_hw\gt' set 'Output dir' to 'heb_hw/data' set 'New language model name' to 'heb_hw' set 'Language type' to 'RTL' note: in this step, it creates per-line files, from og heb. with them, and the files from step 1, it creates a checkout file.
note: allow, at the end, copying Fast to app/tessdata (for testing)
first, copy the traineddata from "heb_hw\data\heb_hw\traineddata_fast" to "app/tessdata"
tesseract -l heb_hw_fast test/test.jpg "ocr (heb_hw)"
DONE !
gt.py ::
import os
import random
import pathlib
import subprocess
langdata = 'app/langdata_lstm'
training_text_file = f'{langdata}/heb/heb.training_text'
unicharset = f'{langdata}/heb.unicharset'
output_directory = 'heb_hw/gt'
count = 10000
lines = []
fonts = ['Gveret Levin AlefAlefAlef Regular','Anka CLM Bold Expanded','Dana Yad AlefAlefAlef Condensed','Gadi Almog AlefAlefAlef Regular','Ktav Yad CLM Medium Italic']
# Open the training text file with UTF-8 encoding
with open(training_text_file, 'r', encoding='utf-8') as input_file:
for line in input_file.readlines():
lines.append(line.strip())
if not os.path.exists(output_directory):
os.mkdir(output_directory)
random.shuffle(lines)
lines = lines[:count]
line_count = 0
for line in lines:
file_name_stem = pathlib.Path(training_text_file).stem
for font in range(len(fonts)):
file_name = f'{file_name_stem}_f{str(font)}_{line_count}'
line_training_text = os.path.join(output_directory, f'{file_name}.gt.txt')
with open(line_training_text, 'w', encoding='utf-8') as output_file:
output_file.writelines([line])
outputbase = f'{output_directory}/{file_name}'
subprocess.run([
'text2image',
f'--font={fonts[font]}',
f'--text={line_training_text}',
f'--outputbase={outputbase}',
'--max_pages=1',
'--strip_unrenderable_words',
f'--unicharset_file={unicharset}'
])
line_count += 1
print (line_count, ' / ', count)