I generated spectrogram of a "seven" utterance using the "egs/tidigits" code from Kaldi, using 23 bins, 20kHz sampling rate, 25ms window, and 10ms shift. Spectrogram appears as below visualized via MATLAB imagesc function:
I am experimenting with using Librosa as an alternative to Kaldi. I set up my code as below using the same number of bins, sampling rate, and window length / shift as above.
time_series, sample_rate = librosa.core.load("7a.wav",sr=20000)
spectrogram = librosa.feature.melspectrogram(time_series, sr=20000, n_mels=23, n_fft=500, hop_length=200)
log_S = librosa.core.logamplitude(spectrogram)
np.savetxt("7a.txt", log_S.T)
However when I visualize the resulting Librosa spectrogram of the same WAV file it looks different:
Can someone please help me understand why these look so different? Across other WAV files I've tried I notice that with my Librosa script above, my fricatives (like the /s/ in "seven" in the above example) are being cutoff and this is greatly affecting my digit classification accuracy. Thank you!
Kaldi applies lifter by default on dct output, thats why upper coefficients are attenuated. See details here.