I have a tauri app that records a user's voice, sends a audio/webm
base64 encoded string to a backend Rust function to processe it with open AI's Whisper model.
It mostly works, except for the fact that the Whisper model always returns "buzzing", "engine revving", or other ambient noises as the data segments.
I've tested it with a wav
file, it works perfectly. Why is it not working with my method?
[
{
"start_timestamp": 0,
"end_timestamp": 224,
"text": " (buzzing)"
},
{
"start_timestamp": 224,
"end_timestamp": 448,
"text": " (buzzing)"
}
]
const audioChunks: Blob[] = [];
function App() {
async function startRecording() {
audioChunks.length = 0;
let stream = await navigator.mediaDevices.getUserMedia({audio: true});
mediaRecorder = new MediaRecorder(stream, {
mimeType: 'audio/webm'
});
mediaRecorder.ondataavailable = (e) => {
audioChunks.push(e.data)
}
mediaRecorder.start();
setIsRecording(true);
}
async function stopRecording() {
mediaRecorder.stop();
setIsRecording(false);
mediaRecorder.onstop = async () => {
const audioBlob = new Blob(audioChunks, { type: 'audio/webm' });
const base64 = await blobToBase64(audioBlob);
console.log({ base64 })
const result = await invoke('process_audio', { audioData: base64.split(',')[1] })
};
}
}
fn parse_base64(base64_str: &str) -> Vec<i16> {
// Decode the base64 content
let decoded_bytes = general_purpose::STANDARD.decode(&base64_str).expect("Failed to decode base64 content");
// Convert the decoded bytes to i16 samples
let samples: Vec<i16> = decoded_bytes
.chunks_exact(2)
.map(|chunk| {
i16::from_le_bytes([chunk[0], chunk[1]])
})
.collect();
samples
}
#[tauri::command]
fn process_audio(state: State<AppState>, audio_data: &str) -> Result<String, String> {
let params = FullParams::new(SamplingStrategy::Greedy { best_of: 1 });
let ctx = state.ctx.lock().unwrap();
let mut model_state = ctx.create_state().expect("failed to create state");
let original_samples = parse_base64(audio_data);
let mut samples = vec![0.0f32; original_samples.len()];
whisper_rs::convert_integer_to_float_audio(&original_samples, &mut samples)
.expect("failed to convert samples");
model_state.full(params, &samples).map_err(|e| format!("failed to run model: {:?}", e))?;
let mut audio_segments = Vec::new();
let num_segments = model_state.full_n_segments().map_err(|e| format!("failed to get number of segments: {:?}", e))?;
for i in 0..num_segments {
let segment_text = model_state.full_get_segment_text(i).map_err(|e| format!("failed to get segment text: {:?}", e))?;
let start_timestamp = model_state.full_get_segment_t0(i).map_err(|e| format!("failed to get segment start timestamp: {:?}", e))?;
let end_timestamp = model_state.full_get_segment_t1(i).map_err(|e| format!("failed to get segment end timestamp: {:?}", e))?;
let audio_segment = AudioSegment {
start_timestamp: start_timestamp as f64,
end_timestamp: end_timestamp as f64,
text: segment_text,
};
audio_segments.push(audio_segment);
}
let json_result = serde_json::to_string(&audio_segments).map_err(|e| format!("failed to serialize audio segments to JSON: {:?}", e))?;
Ok(json_result)
}
It looks like you are using whisper-rs
which exposes bindings to whisper.cpp
.
Whisper.cpp
only allows 16-bit WAV encoded samples, contrary to the OpenAI API which also allows webm
among others.
Therefore the reason, why nothing is detected is that you supply the audio in an unsupported format (e.g. you are using webm
, while only wav
is supported).
If you want to solve the problem, it would probably be best to record the audio as wav
in the browser / tauri. One option to do this, would be to use extendable-media-recorder
in combination with extendable-media-recorder-wav-encoder
.
Alternatively you could convert the webm
data to wav
in Rust, for example using symphonia
and hound
.