pythontensorflowspeech-recognitionmfcchmmlearn

How to train HMM with audio senteces dataset for speech recognition?


I have read some journals and paper of HMM and MFCC but i still got confused on how it works step by step with my dataset (audio of sentences dataset).

My data set Example (Audio Form) :

All i know :

  1. My sentences datasets is used to get the transition probabilty
  2. Hmm states is the phonemes
  3. 39 MFCC features is used to train the HMM models

My Questions :

  1. Do i need to cut my sentences into words or just use sentences for train HMM models?
  2. Do I need phonemes dataset for train ? if yes do i need to train it use HMM too ? if not how my program recognize the phonemes for HMM predict input?
  3. What steps i must do first ?

Note : Im working with python and i used hmmlearn and python_speech_features as my library.


Solution

    1. Do i need to cut my sentences into words or just use sentences for train HMM models?

    Theoretically you just need sentences and phonemes. But having isolated words may be useful for your model (it increases the size of your training data)

    1. Do I need phonemes dataset for train ? if yes do i need to train it use HMM too ? if not how my program recognize the phonemes for HMM predict input?

    You need phonemes, otherwise it will be too hard for your model to find the right phoneme segmentation if it does not have any example of isolated phonemes. You should first train your HMM states on the isolated phonemes and then add the rest of the data. If you have enough data, your model may be able to learn without the isolated phoneme examples, but I wouldn't beat on this.

    1. What steps i must do first ?

    Build your phoneme examples and use them to train a simple HMM model you don't model the transition between phonemes. Once your hidden states have some information about phonemes, you may continue the training on isolated words and sentences.