I would like to prepare an Audio-dataset for a machine learning model.
Each .wav file should be represented as an MFCC image.
While all of the images will have the same MFCC amount (= 20), the lengths of the .wav files are between 3-5 seconds.
Should I manipulate all the .wav files to have the same length? Should I normalize the MFCC values (between 0 and 1) prior to plotting?
Are there any important steps to do with such data before passing it to a machine learning model?
Further reading links would also be appreciated.
Most classifiers will require a fixed size input, yes. You can do this by cutting or padding the MFCCs after you have calculated them. No need to manipulate the WAV/waveform, per se.
Another approach is to split your audio files into multiple analysis windows, say 1 seconds each. A 3 second file can then be done with 3 predictions (or more if one uses overlap), while a 5 second file would take 5 predictions (or more). Then to get clip-wide prediction, one would merge predictions over all windows in the clip. The easy ways to train in this way requires assuming that a label given for the clip is valid for each individual analysis window.