How to process audio with Whisper in Rust

I have a tauri app that records a user's voice, sends a audio/webm base64 encoded string to a backend Rust function to processe it with open AI's Whisper model.

It mostly works, except for the fact that the Whisper model always returns "buzzing", "engine revving", or other ambient noises as the data segments.

I've tested it with a wav file, it works perfectly. Why is it not working with my method?

[
    {
        "start_timestamp": 0,
        "end_timestamp": 224,
        "text": " (buzzing)"
    },
    {
        "start_timestamp": 224,
        "end_timestamp": 448,
        "text": " (buzzing)"
    }
]

const audioChunks: Blob[] = [];

function App() {
  async function startRecording() {
    audioChunks.length = 0;
    let stream = await navigator.mediaDevices.getUserMedia({audio: true});
    mediaRecorder = new MediaRecorder(stream, {
      mimeType: 'audio/webm'
    });

    mediaRecorder.ondataavailable = (e) => {
      audioChunks.push(e.data)
    }

    mediaRecorder.start();
    setIsRecording(true);
  } 
  async function stopRecording() {
    mediaRecorder.stop();
    setIsRecording(false);

    mediaRecorder.onstop = async () => {
      const audioBlob = new Blob(audioChunks, { type: 'audio/webm' });
      const base64 = await blobToBase64(audioBlob);
      console.log({ base64 })

      const result = await invoke('process_audio', { audioData: base64.split(',')[1] })
    };
  } 
}

fn parse_base64(base64_str: &str) -> Vec<i16> {
    // Decode the base64 content
    let decoded_bytes = general_purpose::STANDARD.decode(&base64_str).expect("Failed to decode base64 content");

    // Convert the decoded bytes to i16 samples
    let samples: Vec<i16> = decoded_bytes
        .chunks_exact(2)
        .map(|chunk| {
            i16::from_le_bytes([chunk[0], chunk[1]])
        })
        .collect();

    samples
}

#[tauri::command]
fn process_audio(state: State<AppState>, audio_data: &str) -> Result<String, String>  {
    let params = FullParams::new(SamplingStrategy::Greedy { best_of: 1 });

    let ctx = state.ctx.lock().unwrap();
    let mut model_state = ctx.create_state().expect("failed to create state");

    let original_samples = parse_base64(audio_data);
    let mut samples = vec![0.0f32; original_samples.len()];
    whisper_rs::convert_integer_to_float_audio(&original_samples, &mut samples)
    .expect("failed to convert samples");

    model_state.full(params, &samples).map_err(|e| format!("failed to run model: {:?}", e))?;

    let mut audio_segments = Vec::new();
    let num_segments = model_state.full_n_segments().map_err(|e| format!("failed to get number of segments: {:?}", e))?;
    for i in 0..num_segments {
        let segment_text = model_state.full_get_segment_text(i).map_err(|e| format!("failed to get segment text: {:?}", e))?;
        let start_timestamp = model_state.full_get_segment_t0(i).map_err(|e| format!("failed to get segment start timestamp: {:?}", e))?;
        let end_timestamp = model_state.full_get_segment_t1(i).map_err(|e| format!("failed to get segment end timestamp: {:?}", e))?;

        let audio_segment = AudioSegment {
            start_timestamp: start_timestamp as f64,
            end_timestamp: end_timestamp as f64,
            text: segment_text,
        };
        audio_segments.push(audio_segment);
    }

    let json_result = serde_json::to_string(&audio_segments).map_err(|e| format!("failed to serialize audio segments to JSON: {:?}", e))?;
    Ok(json_result)
}

Solution

It looks like you are using whisper-rs which exposes bindings to whisper.cpp.
Whisper.cpp only allows 16-bit WAV encoded samples, contrary to the OpenAI API which also allows webm among others.

Therefore the reason, why nothing is detected is that you supply the audio in an unsupported format (e.g. you are using webm, while only wav is supported).

Possible solutions

If you want to solve the problem, it would probably be best to record the audio as wav in the browser / tauri. One option to do this, would be to use extendable-media-recorder in combination with extendable-media-recorder-wav-encoder.

Alternatively you could convert the webm data to wav in Rust, for example using symphonia and hound.