pythontimestampspeech-recognitionopenai-apiopenai-whisper

How can I get word-level timestamps in OpenAI's Whisper ASR?


I use OpenAI's Whisper python lib for speech recognition. How can I get word-level timestamps?


To transcribe with OpenAI's Whisper (tested on Ubuntu 20.04 x64 LTS with an Nvidia GeForce RTX 3090):

conda create -y --name whisperpy39 python==3.9
conda activate whisperpy39
pip install git+https://github.com/openai/whisper.git 
sudo apt update && sudo apt install ffmpeg
whisper recording.wav
whisper recording.wav --model large

If using an Nvidia GeForce RTX 3090, add the following after conda activate whisperpy39:

pip install -f https://download.pytorch.org/whl/torch_stable.html
conda install pytorch==1.10.1 torchvision torchaudio cudatoolkit=11.0 -c pytorch

Solution

  • In openai-whisper version 20231117, you can get word level timestamps by setting word_timestamps=True when calling transcribe():

    pip install openai-whisper
    
    import whisper
    model = whisper.load_model("large")
    transcript = model.transcribe(
        word_timestamps=True,
        audio="toto.mp3"
    )
    for segment in transcript['segments']:
        print(''.join(f"{word['word']}[{word['start']}/{word['end']}]" 
                        for word in segment['words']))
    

    prints:

    Toto,[2.98/3.4] I[3.4/3.82] have[3.82/3.96] a[3.96/4.02] feeling[4.02/4.22] we're[4.22/4.44] not[4.44/4.56] in[4.56/4.72] Kansas[4.72/5.14] anymore.[5.14/5.48]