Waveform Comparison

I am working on a personal research project.

My objective is to be able to recognize a sound and identify if it belongs to the IPA or not by comparing it's waveform to a wave form in my data base. I have some skill with Mathematica, SciPy, and PyBrain.

For the first phase, I'm only using the English (US) phonetic alphabet. I have a simple test bank of English phonetic alphabet sound files I found online. The trick here is:

I want to separate a sound file into wave forms that correspond to different syllables- this will take a learning algorithm. So, 'I like apples' would be cut up into the syllable waveforms that would make up the sentence.

Each waveform is then compared against the English PA's wave forms. I'm not certain how to do this part. I was thinking of using Praat to detect the waveforms, capture the image of the wave form and compare it to the one stored in the database with image analysis (which is kind of fun to do).

The damage here, is that I don't know how to make Praat generate a wave form file automatically then cut it up between syllables into waveform chunks. Logically, I would just prepare test cases for a learning algorithm and teach the comp to do it.

Instead of needing a wave form image- could I do this with fast Fourier transformation and compare two fft's- within x% margin of error consider it y syllable?

Solution

You could try Praat scripting.

Using just FFT will give you rather terrible results. Very long feature vector that will be really difficult to segment and run any training on it. That's thousands of points for a single syllable. Some deep neural networks are able to cope with it, but that's assuming you design them properly and provide huge training set. The advantage of using neural networks is that they can build features for you from the "raw data" (and I would consider fft also "raw"). However, when you work with sound, it's not that badly needed - you can manually engineer features. In case of sounds, science knows very well what sort of "features" sound have.

You can calculate these features with libraries like Yaafe. I recommend checking it even if you are not doing it in C++ or Python - the link I provided also delivers formulas for calculating them. I used some of them in my kiwi classifier.

Another good approach comes from scikit-talkbox, which provides exactly the tooling you might need.