pythontensorflowspeech-recognitionfarsimozilla-deepspeech

DeepSpeech failed to learn Persian language


I’m training DeepSpeech from scratch (without checkpoint) with a language model generated using KenLM as stated in its doc. The dataset is a Common Voice dataset for Persian language.

My configurations are as follows:

  1. Batch size = 2 (due to cuda OOM)
  2. Learning rate = 0.0001
  3. Num. neurons = 2048
  4. Num. epochs = 50
  5. Train set size = 7500
  6. Test and Dev sets size = 5000
  7. dropout for layers 1 to 5 = 0.2 (also 0.4 is experimented, same results)

Train and val losses decreases through the training process but after a few epochs val loss does not decrease anymore. Train loss is about 18 and val loss is about 40.

The predictions are all empty strings at the end of the process. Any ideas how to improve the model?


Solution

  • The Persian dataset in Common Voice has around 280 hours of validated audio, so this should be enough to create a model that has better accuracy than you're reporting.

    What would help here is to know what the CER and WER figures are for the model? Being able to see these indicates whether the best course of action lies with the hyperparameters of the acoustic model or with the KenLM language model. The difference is explained here in the testing section of the DeepSpeech PlayBook.

    It is also likely you would need to perform transfer learning on the Persian dataset. I am assuming that the Persian dataset is written in Alefbā-ye Fārsi. This means that you need to drop the alphabet layer in order to learn from the English checkpoints (which use Latin script).

    More information on how to perform transfer learning is in the DeepSpeech documentation, but essentially, you need to do two things: