I’m training DeepSpeech from scratch (without checkpoint) with a language model generated using KenLM as stated in its doc. The dataset is a Common Voice dataset for Persian language.
My configurations are as follows:
Train and val losses decreases through the training process but after a few epochs val loss does not decrease anymore. Train loss is about 18 and val loss is about 40.
The predictions are all empty strings at the end of the process. Any ideas how to improve the model?
The Persian dataset in Common Voice has around 280 hours of validated audio, so this should be enough to create a model that has better accuracy than you're reporting.
What would help here is to know what the CER and WER figures are for the model? Being able to see these indicates whether the best course of action lies with the hyperparameters of the acoustic model or with the KenLM language model. The difference is explained here in the testing section of the DeepSpeech PlayBook.
It is also likely you would need to perform transfer learning on the Persian dataset. I am assuming that the Persian dataset is written in Alefbā-ye Fārsi. This means that you need to drop the alphabet layer in order to learn from the English checkpoints (which use Latin script).
More information on how to perform transfer learning is in the DeepSpeech documentation, but essentially, you need to do two things:
--drop_source_layers 3
flag to drop the source layers, to allow for transfer learning from another alphabet--load_checkpoint_dir deepspeech-data/deepspeech-0.9.3-checkpoint
flag to specify where to load checkpoints from on which to perform transfer learning.