iosaudiofftaudio-processingdct

Using DCT to create real-time "levels" animation for microphone input


For context: I'm trying to create a simple "level monitor" animation of audio data streaming from a microphone. I'm running this code on an iOS device and leaning heavily on the Accelerate framework for data processing.

A lot of what I have so far is heavily influenced by this example project from Apple: https://developer.apple.com/documentation/accelerate/visualizing_sound_as_an_audio_spectrogram

Here are the current steps I'm taking:

  1. Start receiving (Int16) samples from the microphone using AVFoundation.
  2. Store samples until I have at least 1024, then send the first 1024 samples to my processing algorithm.
  3. Convert samples to denormalized Float (single-precision floating point).
  4. Apply a Hanning Window to the samples to prevent aliasing since the number of samples is fairly low, for performance reasons.
  5. Run a Forward DCT-II transformation of the time-domain samples into frequency-domain samples.
  6. Absolute value on all samples.
  7. "Bin" the samples to match the number of bars I have to animate... for each 1024/n samples, find the maximum value in each range.
  8. Normalize each of the bins into the 0...1 range by dividing each by the highest magnitude sample that has been encountered, globally.

Honestly, after step 5, I just have no intuitive understanding of what is going on with the frequency domain values. I get that a higher value means the frequency represented by a single value is more prevalent in the time-domain data... but I don't know what a value of, say 12 vs 6492 means.

Anyway, the end result is that the lowest bin (0...255) has a power that is basically just the overall amplitude, while the higher 3 bins never rise above 0.001. I feel like I'm on the right track, but that my ignorance of what the DCT output means is preventing me from figuring out what is going wrong here. I could also use FFT, if that would produce a better result, but I'm given to understand that FFT and DCT produce analogous results and Apple recommends DCT for performance.


Solution

  • The DFT/DCT is linear in its input. So when the inputs are an amplitude (which is the case from a standard audio file or microphone input), so is the output.

    It seems this will be used for visualization. In that case, I recommend converting the amplitudes into decibels. It will make the range of values much more compact, which is desirable when showing on finite screen real estate, and also quite conventional. For an amplitude that is 20*log10(amp/ref), where ref might be just 1.0 if you are just going to normalize it afterward anyway (8). Note, normalization in decibel domain would be an additive shift, not dividing.

    The frequency bins of the DCT are k/(2N) * fs, where k is the bin, N is the length of the transform, and fs is the samplerate.