pythongoogle-cloud-platformffmpeggoogle-speech-apigoogle-cloud-speech

Google cloud speech to text not giving output for OGG & MP3 files


I am trying to perform speech to text on a bunch of audio files which are over 10 mins long. I don't want to waste storage on the cloud bucket by straight-up uploading wav files on it. So I am using ffmpeg to convert the files either to ogg or mp3 like: ffmpeg -y -i audio.wav -ar 12000 -r 16000 audio.mp3

ffmpeg -y -i audio.wav -ar 12000 -r 16000 audio.ogg

For testing purpose I ran the speech to text service on a dummy wav file and it seemed to work, I got the text as expected. But for some reason it isn't detecting any speech when I use the ogg or mp3 file. I could not give amr files to work either.

My code:

def transcribe_gcs(gcs_uri):
    client = speech.SpeechClient()

    audio = speech.RecognitionAudio(uri=gcs_uri)
    config = speech.RecognitionConfig(
        encoding="OGG_OPUS", #replace with "LINEAR16" for wav, "OGG_OPUS" for ogg, "AMR" for amr
        sample_rate_hertz=16000,
        language_code="en-US",
    )
    print("starting operation")
    operation = client.long_running_recognize(config=config, audio=audio)
    response = operation.result()
    print(response)

I have set up the authentication properly, so that is not a problem.

When I run the speech to text service on the same audio but in ogg or mp3(I just comment out the encoding setting from the config for mp3) format, it gives no response, just prints out a line break and done.

What can I do to fix this?


Solution

  • Use Opus or FLAC

    FLAC

    FLAC is compressed but is lossless. This will result in the best speech-to-text results.

    ffmpeg -i input.wav -vn output.flac
    

    Opus

    If file space is very important then use Opus in OGG. It can make small file sizes with excellent quality.

    ffmpeg -i input.wav -vn -c:a libopus output.ogg