[SOLVED] Speech Recognition with MFCC and DTW

Speech Recognition with MFCC and DTW

So, Basically i had tons of data which word-based dataset. Each of data is absolutely having different length of time.

This is my Approach :

Labelling the given dataset
Split the data using Stratified KFold for Training Data (80%) and Testing data (20%)
Extract the Amplitude, Frequency and Time using MFCC
Because the Time-series each of the data from MFCC extraction are different, i wanted to make all of the data time dimension length are exactly the same using DTW.
Then i will use the DTW data to Train it with Neural Network.

My Question is :

Does my Approach especially in the 4th step are correct?
If My approach was correct then, How can i convert each audio to be the same length with DTW? Because basically i only can compare two audio of MFCC data and when i tried to change to the other audio data the result of the length will absolutely different.

Solution

Ad 1) Labelling

I am not sure what you mean by "labelling" the dataset. Nowadays, all you need for ASR is an utterance and the corresponding text (search e.g. for CommonVoice to get some data). This depends on the model you're using, but neural networks do not require any segmentation or additional labeling etc for this task.

Ad 2) KFold cross-validation

Doing cross-validation never hurts. If you have the time and resources to test your model, go ahead and use cross-validation. I, in my case, just make the test set large enough to make sure I get a representative word-error-rate (WER). But that's mostly because training a model k-times is quite an effort as ASR-models usually take some time to train. There are datasets such as Librispeech (and others) which already have a train/test/dev split for you available. If you want, you can compare your results with academic results. It can be hard though if they used a lot of computational power (and data) which you cannot match so bear that in mind when comparing results.

Ad 3) MFCC Features

MFCC work fine but from my experience and what I found out by reading through literature etc, using the log-Mel-spectrogram is slightly more performant using neural networks. It's not a lot of work to test them both so you might want to try log-Mel as well.

Ad 4) and 5) DTW for same length

If you use a neural network, e.g. a CTC model or a Transducer, or even a Transformer, you don't need to do that. The audio inputs do not require to have the same lengths. Just one thing to keep in mind: If you train your model, make sure your batches do not contain too much padding. You want to use some bucketing like bucket_by_sequence_length().

Just define a batch-size as "number of spectrogram frames" and then use bucketing in order to really make use of the memory you got available. This can really make a huge difference for the quality of model. I learned that the hard way.

Note

You did not specify your use-case so I'll just mention the following: You need to know what you want to do with your model. If the model is supposed to be able to consume an audio-stream s.t. a user can talk arbitrarily long, you need to know and work towards that from the beginning.

Another approach would be: "I only need to transcribe short audio segments." e.g. 10 to 60 seconds or so. In that case you can simply train any Transformer and you'll get pretty good results thanks to its attention mechanism. I recommend to go that road if that's all you need because this is considerably easier. But keep away from this if you need to be able to stream audio content for a much longer time.

Things get a lot more complicated when it comes to streaming. Any purely encoder-decoder attention based model is going to require a lot of effort in order to make this work. You can use RNNs (e.g. RNN-T) but these models can become incredibly huge and slow and will require additional efforts to make them reliable (e.g. language model, beam-search) because they lack the encoder-decoder attention. There are other flavors that combine Transformers with Transducers but if you want to write all this on your own, alone, you're taking on quite a task.

Speech Recognition with MFCC and DTW

Note

See also