audiofftspeech-recognitionkaldi

Understanding audio file spectrogram values


I am currently struggling to understand how the power spectrum is stored in the kaldi framework.

I seem to have successfully created some data files using

$cmd JOB=1:$nj $logdir/spect_${name}.JOB.log \
    compute-spectrogram-feats --verbose=2 \
     scp,p:$logdir/wav_spect_${name}.JOB.scp ark:- \| \
    copy-feats --compress=$compress $write_num_frames_opt ark:- \
      ark,scp:$specto_dir/raw_spectogram_$name.JOB.ark,$specto_dir/raw_spectogram_$name.JOB.scp

Which gives me a large file with data point for different audio files, like this.

The problem is that I am not sure on how I should interpret this data set, I know that prior to this an fft is performed, which I guess is a good thing.

The output example given above is from a file which is 1 second long.
all the standard has been used for computing the spectogram, so the sample frequency should be 16 kHz, framelength = 25 ms and overlap = 10 ms. The number of data points in the first set is 25186.

Given these informations, can I interpret the output in some way?

Usually when one performs fft, the frequency bin size can be extracted by F_s/N=bin_size where F_s is the sample frequency and N is the FFT length. So is this the same case? 16000/25186 = 0.6... Hz/bin?

Or am I interpreting it incorrectly?


Solution

  • Usually when one performs fft, the frequency bin size can be extracted by F_s/N=bin_size where F_s is the sample frequency and N is the FFT length.

    So is this the same case? 16000/25186 = 0.6... Hz/bin?

    The formula F_s/N is indeed what you would use to compute the frequency bin size. However, as you mention N is the FFT length, not the total number of samples. Based on the approximate 25ms framelength, 10ms hop size and the fact that your generated output data file has 98 lines of 257 values for some presumably real-valued input, it would seem that the FFT length used was 512. This would give you a frequency bin size of 16000/512 = 31.25 Hz/bin.

    Based on this scaling, plotting your raw data with the following Matlab script (with the data previously loaded in the Z matrix):

    fs       = 16000; % 16 kHz sampling rate
    hop_size = 0.010; % 10 millisecond 
    [X,Y]=meshgrid([0:size(Z,1)-1]*hop_size, [0:size(Z,2)-1]*fs/512);
    surf(X,Y,transpose(Z),'EdgeColor','None','facecolor','interp');
    view(2);
    xlabel('Time (seconds)');
    ylabel('Frequency (Hz)');
    

    gives this graph (the dark red regions are the areas of highest intensity): Spectrogram