javaaudiofeature-extractionmfcctarsosdsp

How to Merge MFCCs


I am working on extracting MFCC features from some audio files. The program I have currently extracts a series of MFCCs for each file and has a parameter of a buffer size of 1024. I saw the following in a paper:

The feature vectors extracted within a second of audio data are combined by computing the mean and the variance of each feature vector element (merging).

My current code uses TarsosDSP to extract the MFCCs, but I'm not sure how to split the data into "a second of audio data" in order to merge the MFCCs.

My MFCC extraction code

int sampleRate = 44100;
int bufferSize = 1024;
int bufferOverlap = 512;
inStream = new FileInputStream(path);
AudioDispatcher dispatcher = new AudioDispatcher(new UniversalAudioInputStream(inStream, new TarsosDSPAudioFormat(sampleRate, 16, 1, true, true)), bufferSize, bufferOverlap);
final MFCC mfcc = new MFCC(bufferSize, sampleRate, 13, 40, 300, 3000);
dispatcher.addAudioProcessor(mfcc);
dispatcher.addAudioProcessor(new AudioProcessor() {
    @Override
    public void processingFinished() {
        System.out.println("DONE");
    }
    @Override
    public boolean process(AudioEvent audioEvent) {
        return true;  // breakpoint here reveals MFCC data
    }
});
dispatcher.run();

What exactly is buffer size and could it be used to segment the audio into windows of 1 second? Is there a method to divide the series of MFCCs into certain amounts of time?

Any help would be greatly appreciated.


Solution

  • After more research, I came across this website that clearly showed steps in using MFCCs for Weka. It showed some data files with various statistics each listed as separate attributes in Weka. I believe when the paper said

    computing the mean and variance

    they meant the mean and variance of each MFCC coefficient were used as attributes in the combined data file. When I followed the example on the website to merge the MFCCs, I used max, min, range, max position, min position, mean, standard deviation, skewness, kurtosis, quartile, and interquartile range.

    To split the audio input into seconds, I believe sets are of MFCCs are extracted at the sample rate inputted as the parameter, so if I set it to 100, I would wait for 100 cycles to merge the MFCCs. Please correct me if I'm wrong.