We are working on a speech-to-text project. We are quite new in this field and would be very grateful if you could help us.
Our goal is to use MFCC to extract features from the audio dataset, use a CNN model to estimate the likelihood of each feature, and then use an HMM model to convert the audio data to text. All of these steps are clear to us except for the labeling. When we preprocessed the data, we divided the audio data into smaller time frames, with each frame about 45ms long and a 10ms gap between each frame.
I am going to use TIMIT dataset. I am completely confused about the labeling of the data set. I checked the TIMIT dataset and I found out the label file have 3 columns. The First one is BEGIN_SAMPLE :== The beginning integer sample number for the segment, the second one is the ending integer sample number for the segment and the last one is PHONETIC_LABEL :== Single phonetic transcription. How we used this labeling? Are the first and second columns important? Thanks for your time
The first column is the starting time of the phonemes, the second is the ending time.
E.g.
0 3050 h#
3050 4559 sh
h# (silent) starts from 0 ends at 0.305s
sh starts from 0.305s ends at 0.4559s
You can use those labels to train a frame-level phoneme classifier, then build ASR with HMM. Kaldi toolkit has a receipt for the TIMIT dataset.
Also, an ASR system could be built without these time labels. GMM-HMM model could help get those time stamps (alignment). End-to-end ASR could learn that alignment as well.
Based on my experience, new people who wanted to quickly build ASR systems would more likely to be frustrated. Since it is much more complicated than it sounds. So if you want to go deep in the ASR field, you need to spend time on the theories, and skills. Otherwise, I think it's better to rely more on people who have related experience.
Personal opinion.