machine-learningsignal-processingaudio-processingmfccaudio-fingerprinting

How to train a machine learning algorithm using MFCC coefficient vectors?


For my final year project i am trying to identify dog/bark/bird sounds real time (by recording sound clips). I am using MFCC as the audio features. Initially i have extracted altogether 12 MFCC vectors from a sound clip using jAudio library. Now I'm trying to train a machine learning algorithm(at the moment i have not decided the algorithm but it is most probably SVM). The sound clip size is like around 3 seconds. I need to clarify some information about this process. They are,

  1. Do i have to train this algorithm using frame based MFCCs(12 per frame) or or overall clip based MFCCs(12 per sound clip)?

  2. To train the algorithm do i have to consider all the 12 MFCCs as 12 different attributes or do i have to consider those 12 MFCCs as a one attribute ?

These MFCCs are the overall MFCCS for the clip,

-9.598802712290967 -21.644963856237265 -7.405551798816725 -11.638107212413201 -19.441831623156144 -2.780967392843105 -0.5792847321137902 -13.14237288849559 -4.920408873192934 -2.7111507999281925 -7.336670942457227 2.4687330348335212

Any help will be really appreciated to overcome these problems. I couldn't find out a good help on Google. :)


Solution

    1. You should calculate MFCCs per frame. Since your signal varies in time, taking them over whole clip would not make sense. Worse, you might end up with dog and bird having similar representation. I'd experiment with several frame lengths. In general, they will be in order of milliseconds.

    2. All of them should be separate features. Let machine learning algorithm decide whichever are best predictors.

    Mind that MFCCs are sensitive to noise, so do check first how your samples sound. Far richer selection of audio features for extraction is offered by e.g. Yaafe library, many of which will serve better in your case. Which specifically? Here's what I found most useful in classification of bird calls:

    Perhaps you might find interesting to check-out this project, especially the part where I am interfacing with Yaafe.

    Back in the days I used SVMs, exactly as you are planning. Today I would definitively go with gradient boosting.