openai-whispertranscription

How to improve Whisper speech to text


Although Whisper’s transcription is highly accurate, there is always jargon (GPT) or non-standard spellings that make the transcript flawed (example: “Dave Prior” is a podcast host and transcription will spell his last name as “Pryor.”) What are some ways to improve transcription?


Solution

  • There are three usual ways to improve Whisper transcription service:

    1. Prompt Whisper (up to 244 tokens) with a word list. [[1]]
    2. Post process the transcripts with a GPT that is promoted to revise the transcript and supplied with a word list (up to the GPT’s token limit)[[2]]
    3. If your audio file is > 10 min, Whisper performance worsens as the audio file gets longer in length. Break the audio file into chunks of 5-10 minutes and the. Use options 1 on the chunks and option 3 to polish at the end.
    4. Fine tune the model to better understand your accent and domain by training it on an audio file recorded with a word list. [[3]]

    I suggest the above order is in increasing difficulty. If Whisper is having trouble with your accent or how you say acronyms, then fine tuning will be the best solution. The first two options are nice as one could build the prompts dynamically.

    With long recordings (I’ve used up to 40 minutes long so far), I’ve successfully got option 3 to transcribe at 100% accuracy, getting company names correct, people’s names correct, and acronyms correct.