I am attempting to use the python whisper speech to text API via PyO3 rust code.
For the example given on the whisper github found here
import whisper
model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])
It provides just the text. This makes sense given that it's provided the logic for printing the value for "text" in the dict.
My rust program is as follows:
use pyo3::prelude::*;
use pyo3::types::PyTuple;
fn main() -> PyResult<()> {
let arg1 = "tiny";
let arg2 = "/test_dir/test/Test3.opus";
Python::with_gil(|py| {
let whisper = PyModule::import(py, "whisper")?;
let model_args = PyTuple::new(py, &[arg1]);
let audio_args = PyTuple::new(py, &[arg2]);
let model = whisper
.getattr("load_model")?
.call1(model_args)?;
//.extract()?;
println!("Model loaded");
let audio = whisper
.getattr("load_audio")?
.call1(audio_args)?;
println!("Audio loaded");
let result_args = PyTuple::new(py, &[model, audio]);
println!("Arguments setup");
let result = whisper
.getattr("transcribe")?
.call1(result_args)?;
println!("Output is: {}", result);
Ok(())
})
}
The problem I am encountering is that while I do get an output it is not just the text but all three parts of the dict output of for the return of the transcribe. Meaning it's outputting the text, segments and language when I just want the text.
I can't seem to determine, when looking at the PyO3 documentation, how to get just the text from my output.
It's probably something fundamental I am overlooking but any suggestion is appreciated as I am new to rust!
I would wager that your result
is a PyDict. If that's the case, you can probably get the text with the following:
let text = result.get_item("text");