swiftsignal-processinglinear-algebracblas

Extending Mel Spectrogram example from Apple Developers' docs to the case of recorded samples instead of live microphone


I'm building on top of Apple Developer's documentation example called Computing the Mel Spectrum Using Linear Algebra. My goal is to extend this example in order to be able to apply it to samples recorded from live microphone. Specifically I use the subroutine from this example the following way:

while(i*hopCnt + windowSize < samples.count) {
        
    if i*hopCnt + windowSize > samples.count {
        samplesInThisWindow = Array(samples[i*hopCnt+windowSize..<samples.count])
        samplesInThisWindow.append(
                contentsOf: [Float].init(repeating: 0, count: i*hopCnt+windowSize - samples.count)
        )
     } else {
        samplesInThisWindow = Array(samples[i*hopCnt..<i*hopCnt+windowSize])
     }
                
     let FFTValues = subroutineFromExample(samplesInThisWindow: &samplesInThisWindow)
     stftSpectrogram.append(contentOf: FFTValues)
}

return stftSpectrogram

When this routine is completed I get back as a result the STFT transform of the samples and the Mel spectrogram has yet to be computed. This means that FFT is a (time_bins x window_size) matrix where time_bins = (samples_count - window_size)/hop_size + 1.

At this point I'm back at computing the Mel spectrogram using the code in the example, which produces a MEL filterBanks matrix that is a (filterbanks_count x window_size) matrix (Not sure, I guess so since the makeFilterBank method contains the following code: row = i * window_size). Then the code proceeds with the following code to compute the spectrogram through matrix multiply:

cblas_sgemm(CblasRowMajor,
            CblasTrans,
            CblasTrans,
            Int32(1),
            Int32(self.filterbanksCount),
            Int32(self.windowSize),
            1,
            fftResultPtr.baseAddress,
            Int32(1),
            filterBank.baseAddress,
            Int32(self.windowSize),
            0,                
            sgemmResult!.baseAddress,
            Int32(self.filterbanksCount)
        )

So, according to the documentation, this either computes C←αAB + βC or C←αBA + βC where A and B can optionally be transposed.

The code from the examples expects to receive a (1 x window_size) matrix as result of FFT because it processes a single time bin at a time, therefore in this case FFT is (1 x window_size), MEL filterBanks is (filterbanks_count x window_size). Since CblasTrans is specified for both the input matrices and neither A^tB^t nor B^tA^t would be of the right dimensions for a product, I'll assume that MEL filterBanks is actually (window_size x filterbanks_count), which means that sgemmResult = MEL^t*FFT^t and cblas_sgemm is operating in C←αBA + βC mode.

This entails that sgemmResult is a (filterbanks_count x 1) matrix.

Now this is a bit of a pain in the back for me to generalize because my interface expects to receive as an input the rotated spectrogram, that is, a (time_bins x window_size) matrix while the obvious generalization would be to replace all the hardcoded 1s in the code with FFT.count/window_size (FFT is represented in row-major order).

Doing so would produces as an output a (filterbanks_count x time_bins) matrix and as a result, the rendered spectrogram looks funny (looks like multiple frequency bins are combined horizontally to fill the available width, as I'd expect anyway). So my idea was the following: Instead of computing MEL(FFT) = MEL^t*FFT^t I'd compute MEL(FFT)^t = FFT*MEL and get the correct result.

Illustration of the reasoning from Apple Documentation here

(Note that Apple's Developers code uses windowSize as the parameter for number of rows of B matrix (filterBank in the code), therefore filterBank matrix is just about the Filter bank (trasposed) represented in the above image)

This produces the following code:

cblas_sgemm(CblasRowMajor,  //ORDER
                    CblasNoTrans,     //Transpose A? if so, op(A) = A^t, else op(A) = A
                    CblasNoTrans,     //Transpose B? if so, op(B) = B^t, else op(B) = B
                    Int32(fftResult.count/self.windowSize), //A and C's rows.
                    Int32(self.filterbanksCount),   //B and C's cols.
                    Int32(self.windowSize), //A's cols, B's rows
                    1,  //Scale A and B's product
                    fftResultPtr.baseAddress,   //A
                    Int32(self.windowSize), //rows of op(A)^t
                    filterBank.baseAddress, //B
                    Int32(self.filterbanksCount), //rows of op(B)^t
                    0,  //Result scale
                    sgemmResult!.baseAddress,   //C
                    Int32(self.filterbanksCount)    //rows of C^t
            )

So, if I use the above code in place of the existing one, as a result I get a spectrogram where the lower frequencies take about 20% of the overall height while the higher frequencies take the remaining 80%, which means that the hypothetical effect of applying the Mel scale to the linear spectrogram is flipped.

Where did my reasoning fail, and how should I go about fixing my code?


Solution

  • Please note that according to the image in the post, the STFT result, that is (time_bins x window_size), must not be transposed. This observation eventually led me to the correct solution, that is the following:

    cblas_sgemm(CblasRowMajor,  //ORDER
                CblasNoTrans,     //Transpose A?
                CblasTrans,     //Transpose B?
                Int32(fftResult.count/self.windowSize), //A and C's rows.
                Int32(self.filterbanksCount),   //B and C's cols.
                Int32(self.windowSize), //A's cols, B's rows
                1,  //Scale A and B's product
                fftResultPtr.baseAddress,   //A
                Int32(self.windowSize), //rows of op(A)^t
                filterBank.baseAddress, //B
                Int32(self.windowSize), //rows of op(B)^t
                0,  //Result scale
                sgemmResult!.baseAddress,   //C
                Int32(self.filterbanksCount)    //rows of C^t
        )
    

    Where fftResult.count/self.windowSize returns the amount of time bins in the STFT since the STFT matrix is what's called fftResult in the above code