javascriptaudio-streamingweb-audio-apiaudiocontextaudio-worklet

AudioWorkletProcessor playing streamed audio sounds scrambled


I'm trying to play in the browser audio that's streamed from my backend to simulate a real-time voice call, in a way that's compatible with all the relevant browsers, devices, and OSs.

The audio format is mp3 with 44.1kHz sample rate at 192kbps.

app.js

await audioContext.audioWorklet.addModule('call-processor.js');

const audioContext = new AudioContext({ sampleRate: 44100 });
const audioWorkletNode = new AudioWorkletNode(audioContext, 'call-processor');

// add streamed audio chunk to the audioworklet
const addAudioChunk = async (base64) => {
  const buffer = Buffer.from(base64, 'base64');

  try {
    const audioBuffer = await audioContext.decodeAudioData(buffer.buffer);
    const channelData = audioBuffer.getChannelData(0); // Assuming mono audio

    audioWorkletNode.port.postMessage(channelData);
  } catch (e) {
    console.error(e);
  }
};

call-processor.js

class CallProcessor extends AudioWorkletProcessor {
  buffer = new Float32Array(0);

  constructor() {
    super();
    
    this.port.onmessage = this.handleMessage.bind(this);
  }

  handleMessage(event) {
    const chunk = event.data;

    const newBuffer = new Float32Array(this.buffer.length + chunk.length);

    newBuffer.set(this.buffer);
    newBuffer.set(chunk, this.buffer.length);

    this.buffer = newBuffer;
  }

  process(inputs, outputs) {
    const output = outputs[0];
    const channel = output[0];
    const requiredSize = channel.length;

    if (this.buffer.length < requiredSize) {
      // Not enough data, zero-fill the output
      channel.fill(0);
    } else {
      // Process the audio
      channel.set(this.buffer.subarray(0, requiredSize));
      // Remove processed data from the buffer
      this.buffer = this.buffer.subarray(requiredSize);
    }

    return true;
  }
}

registerProcessor('call-processor', CallProcessor);

I'm testing in the Chrome browser.

On PC, the first chunk in each stream response sounds perfect, while on iPhone it sounds weird and robotic.

In both cases, subsequent chunks sound a bit scrambled.

Could you please help me understand what I'm doing wrong?

I'll note that I'm not set on AudioWorklet, the goal is to have a seamless audio stream that's compatible with all the relevant browsers, devices, and OSs.

I've also tried 2 approaches with audio source, using standardized-audio-context (https://github.com/chrisguttandin/standardized-audio-context) for extensive compatibility:

a. Recursively waiting for the current audio buffer source to finish before playing the next one in the sequence.

b. Starting each audio buffer source as soon as its chunk is received, with the starting point being the sum of the durations of the preceding audio buffers.

Both approaches are not seamless and result in jumps between the chunks, which led me to change the strategy to AudioWorklet.

app.js

import { Buffer } from 'buffer';
import { AudioContext } from 'standardized-audio-context';

const audioContext = new AudioContext();

const addAudioChunk = async (base64) => {
  const uint8array = Buffer.from(base64, 'base64');
  const audioBuffer = await audioContext.decodeAudioData(uint8array.buffer);
  const source = audioContext.createBufferSource();

  source.buffer = audioBuffer;

  source.start(nextPlayTime);

  nextPlayTime += audioBuffer.duration;
}

Solution

  • Each MP3 frame starts with 12 bits all set to one. If you cut the MP3 at exactly those boundaries it should be possible to decode the parts with decodeAudioData() since they represent a valid file on their own.

    The problem is that any of these frames may contain data that already belongs to the next frame. When cutting an MP3 at the frame boundary that information is lost because decodeAudioData() does not keep track of any previous invocations. You would need to keep a small overlap when decoding the parts to decode the frames at the boundary twice. The audio data produced by these duplicate frames then needs to be removed from the AudioBuffers produced by decodeAudioData() when stitching them together.

    Imagine you would have a variable called arrayBuffer which holds the content of an MP3. You could then collect all of its frames like this:

    const uint8Array = new Uint8Array(arrayBuffer);
    const frames = [];
    
    for (let i = 0; i < uint8Array.length - 1; i += 1) {
      if (uint8Array[i] === 0xff && (uint8Array[i + 1] & 0xf0) === 0xf0) {
        frames.push(i);
      }
    }
    

    You could then loop through the frames to decode them in groups with an overlap.

    const offlineAudioContext = new OfflineAudioContext({
      length: 1,
      // This should be the sampleRate of the MP3.
      sampleRate: 44100
    });
    
    const cache = [0, 0];
    const channelDatas = [];
    const framesPerInterval = 50;
    
    let begin = 0;
    let end = 0;
    
    for (let i = 0; i < frames.length - framesPerInterval; i += framesPerInterval) {
      const audioBuffer = await offlineAudioContext.decodeAudioData(
        arrayBuffer.slice(
          frames[Math.max(0, i - framesPerInterval)],
          frames[i + framesPerInterval]
        )
      );
    
      const intervalLength = audioBuffer.length - cache[1];
    
      begin = end - cache[0];
      end = cache[1] + Math.round(intervalLength / 2);
    
      // This needs to be done for every channel.
      channelDatas.push(audioBuffer.getChannelData(0).slice(begin, end));
    
      cache[0] = cache[1];
      cache[1] = intervalLength;
    }
    

    In the end channelDatas should contain the samples without any gaps.

    But you could also use an AudioDecoder to decode the audio. It's a stateful API to decode audio files in chunks. For now it's only available in Chrome but implementations in Firefox and Safari are already worked on.