I have read some journals and paper of HMM and MFCC but i still got confused on how it works step by step with my dataset (audio of sentences dataset).
My data set Example (Audio Form) :
All i know :
My Questions :
Note : Im working with python and i used hmmlearn and python_speech_features as my library.
Theoretically you just need sentences and phonemes. But having isolated words may be useful for your model (it increases the size of your training data)
You need phonemes, otherwise it will be too hard for your model to find the right phoneme segmentation if it does not have any example of isolated phonemes. You should first train your HMM states on the isolated phonemes and then add the rest of the data. If you have enough data, your model may be able to learn without the isolated phoneme examples, but I wouldn't beat on this.
Build your phoneme examples and use them to train a simple HMM model you don't model the transition between phonemes. Once your hidden states have some information about phonemes, you may continue the training on isolated words and sentences.