For context: I'm trying to create a simple "level monitor" animation of audio data streaming from a microphone. I'm running this code on an iOS device and leaning heavily on the Accelerate framework for data processing.
A lot of what I have so far is heavily influenced by this example project from Apple: https://developer.apple.com/documentation/accelerate/visualizing_sound_as_an_audio_spectrogram
Here are the current steps I'm taking:
Honestly, after step 5, I just have no intuitive understanding of what is going on with the frequency domain values. I get that a higher value means the frequency represented by a single value is more prevalent in the time-domain data... but I don't know what a value of, say 12 vs 6492 means.
Anyway, the end result is that the lowest bin (0...255) has a power that is basically just the overall amplitude, while the higher 3 bins never rise above 0.001. I feel like I'm on the right track, but that my ignorance of what the DCT output means is preventing me from figuring out what is going wrong here. I could also use FFT, if that would produce a better result, but I'm given to understand that FFT and DCT produce analogous results and Apple recommends DCT for performance.
The DFT/DCT is linear in its input. So when the inputs are an amplitude (which is the case from a standard audio file or microphone input), so is the output.
It seems this will be used for visualization. In that case, I recommend converting the amplitudes into decibels. It will make the range of values much more compact, which is desirable when showing on finite screen real estate, and also quite conventional.
For an amplitude that is 20*log10(amp/ref)
, where ref might be just 1.0 if you are just going to normalize it afterward anyway (8). Note, normalization in decibel domain would be an additive shift, not dividing.
The frequency bins of the DCT are k/(2N) * fs
, where k is the bin, N is the length of the transform, and fs is the samplerate.